Merge remote-tracking branch 'upstream/master' into fix/user-coordinates

* upstream/master: add missing pint integration tests (pydata#3508) DOC: update bottleneck repo url (pydata#3507) add drop_sel, drop_vars, map to api.rst (pydata#3506) remove syntax warning (pydata#3505) Dataset.map, GroupBy.map, Resample.map (pydata#3459) tests for datasets with units (pydata#3447) fix pandas-dev tests (pydata#3491) unpin pseudonetcdf (pydata#3496) whatsnew corrections (pydata#3494) drop_vars; deprecate drop for variables (pydata#3475) uamiv test using only raw uamiv variables (pydata#3485) Optimize dask array equality checks. (pydata#3453)
dcherian · Nov 12, 2019 · d49ceef · d49ceef
2 parents 2279c36 + 4e9240a
commit d49ceef
Show file tree

Hide file tree

Showing 34 changed files with 2,569 additions and 356 deletions.
diff --git a/ci/azure/install.yml b/ci/azure/install.yml
@@ -16,7 +16,7 @@ steps:
         --pre \
         --upgrade \
         matplotlib \
-        pandas=0.26.0.dev0+628.g03c1a3db2 \  # FIXME https://github.com/pydata/xarray/issues/3440
+        pandas \
         scipy
         # numpy \  # FIXME https://github.com/pydata/xarray/issues/3409
     pip install \

diff --git a/ci/requirements/py36.yml b/ci/requirements/py36.yml
@@ -29,7 +29,7 @@ dependencies:
   - pandas
   - pint
   - pip
-  - pseudonetcdf<3.1  # FIXME https://github.com/pydata/xarray/issues/3409
+  - pseudonetcdf
   - pydap
   - pynio
   - pytest

diff --git a/ci/requirements/py37-windows.yml b/ci/requirements/py37-windows.yml
@@ -29,7 +29,7 @@ dependencies:
   - pandas
   - pint
   - pip
-  - pseudonetcdf<3.1  # FIXME https://github.com/pydata/xarray/issues/3409
+  - pseudonetcdf
   - pydap
   # - pynio  # Not available on Windows
   - pytest

diff --git a/ci/requirements/py37.yml b/ci/requirements/py37.yml
@@ -29,7 +29,7 @@ dependencies:
   - pandas
   - pint
   - pip
-  - pseudonetcdf<3.1  # FIXME https://github.com/pydata/xarray/issues/3409
+  - pseudonetcdf
   - pydap
   - pynio
   - pytest

diff --git a/doc/api.rst b/doc/api.rst
@@ -94,7 +94,7 @@ Dataset contents
    Dataset.rename_dims
    Dataset.swap_dims
    Dataset.expand_dims
-   Dataset.drop
+   Dataset.drop_vars
    Dataset.drop_dims
    Dataset.set_coords
    Dataset.reset_coords
@@ -118,6 +118,7 @@ Indexing
    Dataset.loc
    Dataset.isel
    Dataset.sel
+   Dataset.drop_sel
    Dataset.head
    Dataset.tail
    Dataset.thin
@@ -154,7 +155,7 @@ Computation
 .. autosummary::
    :toctree: generated/
 
-   Dataset.apply
+   Dataset.map
    Dataset.reduce
    Dataset.groupby
    Dataset.groupby_bins
@@ -263,7 +264,7 @@ DataArray contents
    DataArray.rename
    DataArray.swap_dims
    DataArray.expand_dims
-   DataArray.drop
+   DataArray.drop_vars
    DataArray.reset_coords
    DataArray.copy
 
@@ -283,6 +284,7 @@ Indexing
    DataArray.loc
    DataArray.isel
    DataArray.sel
+   DataArray.drop_sel
    DataArray.head
    DataArray.tail
    DataArray.thin
@@ -542,10 +544,10 @@ GroupBy objects
    :toctree: generated/
 
    core.groupby.DataArrayGroupBy
-   core.groupby.DataArrayGroupBy.apply
+   core.groupby.DataArrayGroupBy.map
    core.groupby.DataArrayGroupBy.reduce
    core.groupby.DatasetGroupBy
-   core.groupby.DatasetGroupBy.apply
+   core.groupby.DatasetGroupBy.map
    core.groupby.DatasetGroupBy.reduce
 
 Rolling objects
@@ -566,7 +568,7 @@ Resample objects
 ================
 
 Resample objects also implement the GroupBy interface
-(methods like ``apply()``, ``reduce()``, ``mean()``, ``sum()``, etc.).
+(methods like ``map()``, ``reduce()``, ``mean()``, ``sum()``, etc.).
 
 .. autosummary::
    :toctree: generated/

diff --git a/doc/computation.rst b/doc/computation.rst
@@ -183,7 +183,7 @@ a value when aggregating:
 
    Note that rolling window aggregations are faster and use less memory when bottleneck_ is installed. This only applies to numpy-backed xarray objects.
 
-.. _bottleneck: https://github.com/kwgoodman/bottleneck/
+.. _bottleneck: https://github.com/pydata/bottleneck/
 
 We can also manually iterate through ``Rolling`` objects:
 
@@ -462,13 +462,13 @@ Datasets support most of the same methods found on data arrays:
     abs(ds)
 
 Datasets also support NumPy ufuncs (requires NumPy v1.13 or newer), or
-alternatively you can use :py:meth:`~xarray.Dataset.apply` to apply a function
+alternatively you can use :py:meth:`~xarray.Dataset.map` to map a function
 to each variable in a dataset:
 
 .. ipython:: python
 
     np.sin(ds)
-    ds.apply(np.sin)
+    ds.map(np.sin)
 
 Datasets also use looping over variables for *broadcasting* in binary
 arithmetic. You can do arithmetic between any ``DataArray`` and a dataset:

diff --git a/doc/dask.rst b/doc/dask.rst
@@ -292,7 +292,7 @@ For the best performance when using Dask's multi-threaded scheduler, wrap a
 function that already releases the global interpreter lock, which fortunately
 already includes most NumPy and Scipy functions. Here we show an example
 using NumPy operations and a fast function from
-`bottleneck <https://github.com/kwgoodman/bottleneck>`__, which
+`bottleneck <https://github.com/pydata/bottleneck>`__, which
 we use to calculate `Spearman's rank-correlation coefficient <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`__:
 
 .. code-block:: python

diff --git a/doc/data-structures.rst b/doc/data-structures.rst
@@ -393,14 +393,14 @@ methods (like pandas) for transforming datasets into new objects.
 
 For removing variables, you can select and drop an explicit list of
 variables by indexing with a list of names or using the
-:py:meth:`~xarray.Dataset.drop` methods to return a new ``Dataset``. These
+:py:meth:`~xarray.Dataset.drop_vars` methods to return a new ``Dataset``. These
 operations keep around coordinates:
 
 .. ipython:: python
 
     ds[['temperature']]
     ds[['temperature', 'temperature_double']]
-    ds.drop('temperature')
+    ds.drop_vars('temperature')
 
 To remove a dimension, you can use :py:meth:`~xarray.Dataset.drop_dims` method.
 Any variables using that dimension are dropped:

diff --git a/doc/groupby.rst b/doc/groupby.rst
@@ -35,10 +35,11 @@ Let's create a simple example dataset:
 
 .. ipython:: python
 
-    ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
-                    coords={'x': [10, 20, 30, 40],
-                            'letters': ('x', list('abba'))})
-    arr = ds['foo']
+    ds = xr.Dataset(
+        {"foo": (("x", "y"), np.random.rand(4, 3))},
+        coords={"x": [10, 20, 30, 40], "letters": ("x", list("abba"))},
+    )
+    arr = ds["foo"]
     ds
 
 If we groupby the name of a variable or coordinate in a dataset (we can also
@@ -93,15 +94,15 @@ Apply
 ~~~~~
 
 To apply a function to each group, you can use the flexible
-:py:meth:`~xarray.DatasetGroupBy.apply` method. The resulting objects are automatically
+:py:meth:`~xarray.DatasetGroupBy.map` method. The resulting objects are automatically
 concatenated back together along the group axis:
 
 .. ipython:: python
 
     def standardize(x):
         return (x - x.mean()) / x.std()
 
-    arr.groupby('letters').apply(standardize)
+    arr.groupby('letters').map(standardize)
 
 GroupBy objects also have a :py:meth:`~xarray.DatasetGroupBy.reduce` method and
 methods like :py:meth:`~xarray.DatasetGroupBy.mean` as shortcuts for applying an
@@ -202,7 +203,7 @@ __ http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dimen
         dims=['ny','nx'])
     da
     da.groupby('lon').sum(...)
-    da.groupby('lon').apply(lambda x: x - x.mean(), shortcut=False)
+    da.groupby('lon').map(lambda x: x - x.mean(), shortcut=False)
 
 Because multidimensional groups have the ability to generate a very large
 number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`

diff --git a/doc/howdoi.rst b/doc/howdoi.rst
@@ -44,7 +44,7 @@ How do I ...
    * - convert a possibly irregularly sampled timeseries to a regularly sampled timeseries
      - :py:meth:`DataArray.resample`, :py:meth:`Dataset.resample` (see :ref:`resampling` for more)
    * - apply a function on all data variables in a Dataset
-     - :py:meth:`Dataset.apply`
+     - :py:meth:`Dataset.map`
    * - write xarray objects with complex values to a netCDF file
      - :py:func:`Dataset.to_netcdf`, :py:func:`DataArray.to_netcdf` specifying ``engine="h5netcdf", invalid_netcdf=True``
    * - make xarray objects look like other xarray objects

diff --git a/doc/indexing.rst b/doc/indexing.rst
@@ -232,14 +232,14 @@ Using indexing to *assign* values to a subset of dataset (e.g.,
 Dropping labels and dimensions
 ------------------------------
 
-The :py:meth:`~xarray.Dataset.drop` method returns a new object with the listed
+The :py:meth:`~xarray.Dataset.drop_sel` method returns a new object with the listed
 index labels along a dimension dropped:
 
 .. ipython:: python
 
-    ds.drop(space=['IN', 'IL'])
+    ds.drop_sel(space=['IN', 'IL'])
 
-``drop`` is both a ``Dataset`` and ``DataArray`` method.
+``drop_sel`` is both a ``Dataset`` and ``DataArray`` method.
 
 Use :py:meth:`~xarray.Dataset.drop_dims` to drop a full dimension from a Dataset.
 Any variables with these dimensions are also dropped:

diff --git a/doc/installing.rst b/doc/installing.rst
@@ -43,7 +43,7 @@ For accelerating xarray
 
 - `scipy <http://scipy.org/>`__: necessary to enable the interpolation features for
   xarray objects
-- `bottleneck <https://github.com/kwgoodman/bottleneck>`__: speeds up
+- `bottleneck <https://github.com/pydata/bottleneck>`__: speeds up
   NaN-skipping and rolling window aggregations by a large factor
 - `numbagg <https://github.com/shoyer/numbagg>`_: for exponential rolling
   window operations

diff --git a/doc/quick-overview.rst b/doc/quick-overview.rst
@@ -142,7 +142,7 @@ xarray supports grouped operations using a very similar API to pandas (see :ref:
     labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')
     labels
     data.groupby(labels).mean('y')
-    data.groupby(labels).apply(lambda x: x - x.min())
+    data.groupby(labels).map(lambda x: x - x.min())
 
 Plotting
 --------

diff --git a/doc/whats-new.rst b/doc/whats-new.rst
@@ -38,6 +38,19 @@ Breaking changes
 
 New Features
 ~~~~~~~~~~~~
+- :py:meth:`Dataset.drop_sel` & :py:meth:`DataArray.drop_sel` have been added for dropping labels.
+  :py:meth:`Dataset.drop_vars` & :py:meth:`DataArray.drop_vars` have been added for 
+  dropping variables (including coordinates). The existing ``drop`` methods remain as a backward compatible 
+  option for dropping either labels or variables, but using the more specific methods is encouraged.
+  (:pull:`3475`)
+  By `Maximilian Roos <https://github.com/max-sixty>`_
+- :py:meth:`Dataset.map` & :py:meth:`GroupBy.map` & :py:meth:`Resample.map` have been added for 
+  mapping / applying a function over each item in the collection, reflecting the widely used
+  and least surprising name for this operation.
+  The existing ``apply`` methods remain for backward compatibility, though using the ``map``
+  methods is encouraged.
+  (:pull:`3459`)
+  By `Maximilian Roos <https://github.com/max-sixty>`_
 - :py:meth:`Dataset.transpose` and :py:meth:`DataArray.transpose` now support an ellipsis (`...`)
   to represent all 'other' dimensions. For example, to move one dimension to the front,
   use ``.transpose('x', ...)``. (:pull:`3421`)
@@ -74,6 +87,10 @@ Bug fixes
 - Fix grouping over variables with NaNs. (:issue:`2383`, :pull:`3406`).
   By `Deepak Cherian <https://github.com/dcherian>`_.
 - Sync with cftime by removing ``dayofwk=-1`` for cftime>=1.0.4.
+- Use dask names to compare dask objects prior to comparing values after computation.
+  (:issue:`3068`, :issue:`3311`, :issue:`3454`, :pull:`3453`).
+  By `Deepak Cherian <https://github.com/dcherian>`_.
+- Sync with cftime by removing `dayofwk=-1` for cftime>=1.0.4.
   By `Anderson Banihirwe <https://github.com/andersy005>`_.
 - Fix :py:meth:`xarray.core.groupby.DataArrayGroupBy.reduce` and
   :py:meth:`xarray.core.groupby.DatasetGroupBy.reduce` when reducing over multiple dimensions.
@@ -98,7 +115,7 @@ Internal Changes
 ~~~~~~~~~~~~~~~~
 
 - Added integration tests against `pint <https://pint.readthedocs.io/>`_.
-  (:pull:`3238`) by `Justus Magin <https://github.com/keewis>`_.
+  (:pull:`3238`, :pull:`3447`, :pull:`3508`) by `Justus Magin <https://github.com/keewis>`_.
 
   .. note::
 
@@ -114,6 +131,8 @@ Internal Changes
 - Run basic CI tests on Python 3.8. (:pull:`3477`)
   By `Maximilian Roos <https://github.com/max-sixty>`_
 
+- Enable type checking on default sentinel values (:pull:`3472`)
+  By `Maximilian Roos <https://github.com/max-sixty>`_
 
 .. _whats-new.0.14.0:
 
@@ -3721,7 +3740,7 @@ Breaking changes
   warnings: methods and attributes that were deprecated in xray v0.3 or earlier
   (e.g., ``dimensions``, ``attributes```) have gone away.
 
-.. _bottleneck: https://github.com/kwgoodman/bottleneck
+.. _bottleneck: https://github.com/pydata/bottleneck
 
 Enhancements
 ~~~~~~~~~~~~
@@ -3752,6 +3771,7 @@ Enhancements
   explicitly listed variables or index labels:
 
   .. ipython:: python
+     :okwarning:
 
       # drop variables
       ds = xray.Dataset({'x': 0, 'y': 1})

diff --git a/setup.cfg b/setup.cfg
@@ -1,7 +1,7 @@
 [tool:pytest]
 python_files=test_*.py
 testpaths=xarray/tests properties
-# Fixed upstream in https://github.com/kwgoodman/bottleneck/pull/199
+# Fixed upstream in https://github.com/pydata/bottleneck/pull/199
 filterwarnings =
     ignore:Using a non-tuple sequence for multidimensional indexing is deprecated:FutureWarning
 env =

diff --git a/xarray/core/concat.py b/xarray/core/concat.py
@@ -2,6 +2,7 @@
 
 from . import dtypes, utils
 from .alignment import align
+from .duck_array_ops import lazy_array_equiv
 from .merge import _VALID_COMPAT, unique_variable
 from .variable import IndexVariable, Variable, as_variable
 from .variable import concat as concat_vars
@@ -189,26 +190,43 @@ def process_subset_opt(opt, subset):
                 # all nonindexes that are not the same in each dataset
                 for k in getattr(datasets[0], subset):
                     if k not in concat_over:
-                        # Compare the variable of all datasets vs. the one
-                        # of the first dataset. Perform the minimum amount of
-                        # loads in order to avoid multiple loads from disk
-                        # while keeping the RAM footprint low.
-                        v_lhs = datasets[0].variables[k].load()
-                        # We'll need to know later on if variables are equal.
-                        computed = []
-                        for ds_rhs in datasets[1:]:
-                            v_rhs = ds_rhs.variables[k].compute()
-                            computed.append(v_rhs)
-                            if not getattr(v_lhs, compat)(v_rhs):
-                                concat_over.add(k)
-                                equals[k] = False
-                                # computed variables are not to be re-computed
-                                # again in the future
-                                for ds, v in zip(datasets[1:], computed):
-                                    ds.variables[k].data = v.data
+                        equals[k] = None
+                        variables = [ds.variables[k] for ds in datasets]
+                        # first check without comparing values i.e. no computes
+                        for var in variables[1:]:
+                            equals[k] = getattr(variables[0], compat)(
+                                var, equiv=lazy_array_equiv
+                            )
+                            if equals[k] is not True:
+                                # exit early if we know these are not equal or that
+                                # equality cannot be determined i.e. one or all of
+                                # the variables wraps a numpy array
                                 break
-                        else:
-                            equals[k] = True
+
+                        if equals[k] is False:
+                            concat_over.add(k)
+
+                        elif equals[k] is None:
+                            # Compare the variable of all datasets vs. the one
+                            # of the first dataset. Perform the minimum amount of
+                            # loads in order to avoid multiple loads from disk
+                            # while keeping the RAM footprint low.
+                            v_lhs = datasets[0].variables[k].load()
+                            # We'll need to know later on if variables are equal.
+                            computed = []
+                            for ds_rhs in datasets[1:]:
+                                v_rhs = ds_rhs.variables[k].compute()
+                                computed.append(v_rhs)
+                                if not getattr(v_lhs, compat)(v_rhs):
+                                    concat_over.add(k)
+                                    equals[k] = False
+                                    # computed variables are not to be re-computed
+                                    # again in the future
+                                    for ds, v in zip(datasets[1:], computed):
+                                        ds.variables[k].data = v.data
+                                    break
+                            else:
+                                equals[k] = True
 
             elif opt == "all":
                 concat_over.update(
@@ -370,7 +388,7 @@ def ensure_common_dims(vars):
     result = result.set_coords(coord_names)
     result.encoding = result_encoding
 
-    result = result.drop(unlabeled_dims, errors="ignore")
+    result = result.drop_vars(unlabeled_dims, errors="ignore")
 
     if coord is not None:
         # add concat dimension last to ensure that its in the final Dataset