Merge pull request #2597 from bjlittle/dask-merge-back

Dask merge back
SciTools · Jun 14, 2017 · 6bd26b7 · 6bd26b7
2 parents 2b9c2aa + e2eeea3
commit 6bd26b7
Show file tree

Hide file tree

Showing 776 changed files with 12,220 additions and 5,695 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -22,7 +22,7 @@ git:
   depth: 10000
 
 install:
-  - export IRIS_TEST_DATA_REF="7c0e32c8812b464e467a9555bdc25dc1e0c5be0c"
+  - export IRIS_TEST_DATA_REF="2f3a6bcf25f81bd152b3d66223394074c9069a96"
   - export IRIS_TEST_DATA_SUFFIX=$(echo "${IRIS_TEST_DATA_REF}" | sed "s/^v//")
 
   # Install miniconda
@@ -54,7 +54,7 @@ install:
       conda install --quiet --file minimal-conda-requirements.txt;
     else
       if [[ "$TRAVIS_PYTHON_VERSION" == 3* ]]; then
-        sed -e '/ecmwf_grib/d' -e '/esmpy/d' -e '/iris-grib/d' -e 's/#.\+$//' conda-requirements.txt | xargs conda install --quiet;
+        sed -e '/ecmwf_grib/d' -e '/esmpy/d' -e 's/#.\+$//' conda-requirements.txt | xargs conda install --quiet;
       else
         conda install --quiet --file conda-requirements.txt;
       fi

diff --git a/INSTALL b/INSTALL
@@ -80,9 +80,6 @@ numpy 1.9 or later (http://numpy.scipy.org/)
     Python package for scientific computing including a powerful N-dimensional
     array object.
 
-biggus 0.14 or later (https://github.com/SciTools/biggus)
-    Virtual large arrays and lazy evaluation.
-
 scipy 0.10 or later (http://www.scipy.org/)
     Python package for scientific computing.
 
@@ -128,10 +125,6 @@ grib-api 1.9.16 or later
     edition 2 messages. A compression library such as Jasper is required 
     to read JPEG2000 compressed GRIB2 files.
 
-iris-grib 0.9 or later
-    (https://github.com/scitools/iris-grib)
-    Iris interface to ECMWF's GRIB API
-
 matplotlib 1.2.0 (http://matplotlib.sourceforge.net/)
     Python package for 2D plotting.  
 

diff --git a/conda-requirements.txt b/conda-requirements.txt
@@ -2,14 +2,14 @@
 # conda create -n <name> --file conda-requirements.txt
 
 # Mandatory dependencies
-biggus
 cartopy
 matplotlib<1.9
 netcdf4
 numpy
 pyke
 udunits2
 cf_units
+dask
 
 # Iris build dependencies
 setuptools
@@ -25,12 +25,12 @@ imagehash
 requests
 
 # Optional iris dependencies
-nc_time_axis
-iris-grib
+ecmwf_grib
 esmpy>=7.0
 gdal
 libmo_unpack
-pandas
-pyugrid
 mo_pack
+nc_time_axis
+pandas
 python-stratify
+pyugrid
diff --git a/docs/iris/src/conf.py b/docs/iris/src/conf.py
@@ -158,8 +158,6 @@
    'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
    'matplotlib': ('http://matplotlib.org/', None),
    'cartopy': ('http://scitools.org.uk/cartopy/docs/latest/', None),
-   'biggus': ('https://biggus.readthedocs.io/en/latest/', None),
-   'iris-grib': ('http://iris-grib.readthedocs.io/en/latest/', None),
 }
 
 

diff --git a/docs/iris/src/developers_guide/dask_interface.rst b/docs/iris/src/developers_guide/dask_interface.rst
@@ -0,0 +1,35 @@
+Iris Dask Interface
+*******************
+
+Iris uses `dask <http://dask.pydata.org>`_ to manage lazy data interfaces and processing graphs.
+The key principles that define this interface are:
+
+* A call to :attr:`cube.data` will always load all of the data.
+
+* Once this has happened:
+
+  * :attr:`cube.data` is a mutable NumPy masked array or ``ndarray``, and
+  * ``cube._numpy_array`` is a private NumPy masked array, accessible via :attr:`cube.data`, which may strip off the mask and return a reference to the bare ``ndarray``.
+
+* You can use :attr:`cube.data` to set the data. This accepts:
+
+  * a NumPy array (including masked array), which is assigned to ``cube._numpy_array``, or
+  * a dask array, which is assigned to ``cube._dask_array``, while ``cube._numpy_array`` is set to None.
+
+* ``cube._dask_array`` may be None, otherwise it is expected to be a dask array:
+
+  * this may wrap a proxy to a file collection, or
+  * this may wrap the NumPy array in ``cube._numpy_array``.
+
+* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:
+
+  * Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values.
+  * Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``.
+
+* In order to support this mask conversion, cubes have a ``fill_value`` defined as part of their metadata, which may be ``None``.
+
+* Array copying is kept to an absolute minimum:
+
+  * array references should always be passed, not new arrays created, unless an explicit copy operation is requested.
+
+* To test for the presence of a dask array of any sort, we use :func:`iris._lazy_data.is_lazy_data`. This is implemented as ``hasattr(data, 'compute')``.
diff --git a/docs/iris/src/developers_guide/index.rst b/docs/iris/src/developers_guide/index.rst
@@ -38,3 +38,4 @@
    tests.rst
    deprecations.rst
    release.rst
+   dask_interface.rst
diff --git a/docs/iris/src/userguide/index.rst b/docs/iris/src/userguide/index.rst
@@ -6,11 +6,11 @@ Iris user guide
 
 How to use the user guide
 ---------------------------
-If you are reading this user guide for the first time it is strongly recommended that you read the user guide 
-fully before experimenting with your own data files. 
+If you are reading this user guide for the first time it is strongly recommended that you read the user guide
+fully before experimenting with your own data files.
 
 
-Much of the content has supplementary links to the reference documentation; you will not need to follow these 
+Much of the content has supplementary links to the reference documentation; you will not need to follow these
 links in order to understand the guide but they may serve as a useful reference for future exploration.
 
 .. htmlonly::
@@ -30,6 +30,7 @@ User guide table of contents
    saving_iris_cubes.rst
    navigating_a_cube.rst
    subsetting_a_cube.rst
+   real_and_lazy_data.rst
    plotting_a_cube.rst
    interpolation_and_regridding.rst
    merge_and_concat.rst

diff --git a/docs/iris/src/userguide/interpolation_and_regridding.rst b/docs/iris/src/userguide/interpolation_and_regridding.rst
@@ -176,8 +176,8 @@ For example, to mask values that lie beyond the range of the original data:
    >>> scheme = iris.analysis.Linear(extrapolation_mode='mask')
    >>> new_column = column.interpolate(sample_points, scheme)
    >>> print(new_column.coord('altitude').points)
-   [           nan   494.44451904   588.88891602   683.33325195   777.77783203
-      872.222229     966.66674805  1061.11108398  1155.55541992            nan]
+   [-- 494.44451904296875 588.888916015625 683.333251953125 777.77783203125
+    872.2222290039062 966.666748046875 1061.111083984375 1155.555419921875 --]
 
 
 .. _caching_an_interpolator:

diff --git a/docs/iris/src/userguide/real_and_lazy_data.rst b/docs/iris/src/userguide/real_and_lazy_data.rst
@@ -0,0 +1,230 @@
+.. _real_and_lazy_data:
+
+
+.. testsetup:: *
+
+    import dask.array as da
+    import iris
+    import numpy as np
+
+
+==================
+Real and Lazy Data
+==================
+
+We have seen in the :doc:`user_guide_introduction` section of the user guide that
+Iris cubes contain data and metadata about a phenomenon. The data element of a cube
+is always an array, but the array may be either "real" or "lazy".
+
+In this section of the user guide we will look specifically at the concepts of
+real and lazy data as they apply to the cube and other data structures in Iris.
+
+
+What is real and lazy data?
+---------------------------
+
+In Iris, we use the term **real data** to describe data arrays that are loaded
+into memory. Real data is typically provided as a
+`NumPy array <https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html>`_,
+which has a shape and data type that are used to describe the array's data points.
+Each data point takes up a small amount of memory, which means large NumPy arrays can
+take up a large amount of memory.
+
+Conversely, we use the term **lazy data** to describe data that is not loaded into memory.
+(This is sometimes also referred to as **deferred data**.)
+In Iris, lazy data is provided as a
+`dask array <http://dask.pydata.org/en/latest/array-overview.html>`_.
+A dask array also has a shape and data type
+but typically the dask array's data points are not loaded into memory.
+Instead the data points are stored on disk and only loaded into memory in
+small chunks when absolutely necessary (see the section :ref:`when_real_data`
+for examples of when this might happen).
+
+The primary advantage of using lazy data is that it enables
+`out-of-core processing <https://en.wikipedia.org/wiki/Out-of-core_algorithm>`_;
+that is, the loading and manipulating of datasets that otherwise would not fit into memory.
+
+You can check whether a cube has real data or lazy data by using the method
+:meth:`~iris.cube.Cube.has_lazy_data`. For example::
+
+    >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
+    >>> cube.has_lazy_data()
+    True
+    # Realise the lazy data.
+    >>> cube.data
+    >>> cube.has_lazy_data()
+    False
+
+
+.. _when_real_data:
+
+When does my data become real?
+------------------------------
+
+When you load a dataset using Iris the data array will almost always initially be
+a lazy array. This section details some operations that will realise lazy data
+as well as some operations that will maintain lazy data. We use the term **realise**
+to mean converting lazy data into real data.
+
+Most operations on data arrays can be run equivalently on both real and lazy data.
+If the data array is real then the operation will be run on the data array
+immediately. The results of the operation will be available as soon as processing is completed.
+If the data array is lazy then the operation will be deferred and the data array will
+remain lazy until you request the result (such as when you call ``cube.data``)::
+
+    >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
+    >>> cube.has_lazy_data()
+    True
+    >>> cube += 5
+    >>> cube.has_lazy_data()
+    True
+
+The process by which the operation is deferred until the result is requested is
+referred to as **lazy evaluation**.
+
+Certain operations, including regridding and plotting, can only be run on real data.
+Calling such operations on lazy data will automatically realise your lazy data.
+
+You can also realise (and so load into memory) your cube's lazy data if you 'touch' the data.
+To 'touch' the data means directly accessing the data by calling ``cube.data``,
+as in the previous example.
+
+Core data
+^^^^^^^^^
+
+Cubes have the concept of "core data". This returns the cube's data in its
+current state:
+
+ * If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
+   will return the cube's lazy dask array. Calling the cube's
+   :meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data.
+ * If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
+   will return the cube's real NumPy array.
+
+For example::
+
+    >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
+    >>> cube.has_lazy_data()
+    True
+
+    >>> the_data = cube.core_data()
+    >>> type(the_data)
+    <class 'dask.array.core.Array'>
+    >>> cube.has_lazy_data()
+    True
+
+    # Realise the lazy data.
+    >>> cube.data
+    >>> the_data = cube.core_data()
+    >>> type(the_data)
+    <type 'numpy.ndarray'>
+    >>> cube.has_lazy_data()
+    False
+
+
+Coordinates
+-----------
+
+In the same way that Iris cubes contain a data array, Iris coordinates contain a
+points array and an optional bounds array.
+Coordinate points and bounds arrays can also be real or lazy:
+
+ * A :class:`~iris.coords.DimCoord` will only ever have **real** points and bounds
+   arrays because of monotonicity checks that realise lazy arrays.
+ * An :class:`~iris.coords.AuxCoord` can have **real or lazy** points and bounds.
+ * An :class:`~iris.aux_factory.AuxCoordFactory` (or derived coordinate)
+   can have **real or lazy** points and bounds. If all of the
+   :class:`~iris.coords.AuxCoord` instances used to construct the derived coordinate
+   have real points and bounds then the derived coordinate will have real points
+   and bounds, otherwise the derived coordinate will have lazy points and bounds.
+
+Iris cubes and coordinates have very similar interfaces, which extends to accessing
+coordinates' lazy points and bounds:
+
+.. doctest::
+
+    >>> cube = iris.load_cube(iris.sample_data_path('hybrid_height.nc'))
+
+    >>> dim_coord = cube.coord('model_level_number')
+    >>> print(dim_coord.has_lazy_points())
+    False
+    >>> print(dim_coord.has_bounds())
+    False
+    >>> print(dim_coord.has_lazy_bounds())
+    False
+
+    >>> aux_coord = cube.coord('sigma')
+    >>> print(aux_coord.has_lazy_points())
+    True
+    >>> print(aux_coord.has_bounds())
+    True
+    >>> print(aux_coord.has_lazy_bounds())
+    True
+
+    # Realise the lazy points. This will **not** realise the lazy bounds.
+    >>> points = aux_coord.points
+    >>> print(aux_coord.has_lazy_points())
+    False
+    >>> print(aux_coord.has_lazy_bounds())
+    True
+
+    >>> derived_coord = cube.coord('altitude')
+    >>> print(derived_coord.has_lazy_points())
+    True
+    >>> print(derived_coord.has_bounds())
+    True
+    >>> print(derived_coord.has_lazy_bounds())
+    True
+
+.. note::
+    Printing a lazy :class:`~iris.coords.AuxCoord` will realise its points and bounds arrays!
+
+
+Dask processing options
+-----------------------
+
+As stated earlier in this user guide section, Iris uses dask to provide
+lazy data arrays for both Iris cubes and coordinates. Iris also uses dask
+functionality for processing deferred operations on lazy arrays.
+
+Dask provides processing options to control how deferred operations on lazy arrays
+are computed. This is provided via the ``dask.set_options`` interface.
+We can make use of this functionality in Iris. This means we can
+control how dask arrays in Iris are processed, for example giving us power to
+run Iris processing in parallel.
+
+Iris by default applies a single dask processing option. This specifies that
+all dask processing in Iris should be run in serial (that is, without any
+parallel processing enabled).
+
+The dask processing option applied by Iris can be overridden by manually setting
+dask processing options for either or both of:
+
+ * the number of parallel workers to use,
+ * the scheduler to use.
+
+This must be done **before** importing Iris. For example, to specify that dask
+processing within Iris should use four workers in a thread pool::
+
+    >>> from multiprocessing.pool import ThreadPool
+    >>> import dask
+    >>> dask.set_options(get=dask.threaded.get, pool=ThreadPool(4))
+
+    >>> import iris
+    >>> # Iris processing here...
+
+.. note::
+    These dask processing options will last for the lifetime of the Python session
+    and must be re-applied in other or subsequent sessions.
+
+Other dask processing options are also available. See the
+`dask documentation <http://dask.pydata.org/en/latest/scheduler-overview.html>`_
+for more information on setting dask processing options.
+
+
+Further reading
+---------------
+
+This section of the Iris user guide provides a quick overview of real and lazy
+data within Iris. For more details on these and related concepts,
+see the whitepaper on lazy data.