Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Optional indexes (no more default coordinates given by range(n)) #1017

Merged
merged 8 commits into from
Dec 15, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ Attributes
Dataset.coords
Dataset.attrs
Dataset.indexes
Dataset.get_index

Dictionary interface
--------------------
Expand Down Expand Up @@ -196,6 +197,7 @@ Attributes
DataArray.attrs
DataArray.encoding
DataArray.indexes
DataArray.get_index

**ndarray attributes**:
:py:attr:`~DataArray.ndim`
Expand Down
17 changes: 11 additions & 6 deletions doc/computation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,9 @@ This means, for example, that you always subtract an array from its transpose:
You can explicitly broadcast xaray data structures by using the
:py:func:`~xarray.broadcast` function:

a2, b2 = xr.broadcast(a, b2)
.. ipython:: python

a2, b2 = xr.broadcast(a, b)
a2
b2

Expand All @@ -215,15 +217,18 @@ operations. The default result of a binary operation is by the *intersection*

.. ipython:: python

arr + arr[:1]
arr = xr.DataArray(np.arange(3), [('x', range(3))])
arr + arr[:-1]

If the result would be empty, an error is raised instead:
If coordinate values for a dimension are missing on either argument, all
matching dimensions must have the same size:

.. ipython::
.. ipython:: python

@verbatim
In [1]: arr[:2] + arr[2:]
ValueError: no overlapping labels for some dimensions: ['x']
In [1]: arr + xr.DataArray([1, 2], dims='x')
ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension size(s) {2} than the size of the aligned dimension labels: 3


However, one can explicitly change this default automatic alignment type ("inner")
via :py:func:`~xarray.set_options()` in context manager:
Expand Down
55 changes: 35 additions & 20 deletions doc/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,18 +67,33 @@ in with default values:

xr.DataArray(data)

As you can see, dimensions and coordinate arrays corresponding to each
dimension are always present. This behavior is similar to pandas, which fills
in index values in the same way.
As you can see, dimension names are always present in the xarray data model: if
you do not provide them, defaults of the form ``dim_N`` will be created.

.. note::

Prior to xarray v0.9, coordinates corresponding to dimension were *also*
always present in xarray: xarray would create default coordinates of the form
``range(dim_size)`` if coordinates were not supplied explicitly. This is no
longer the case.

Coordinates can take the following forms:

- A list of ``(dim, ticks[, attrs])`` pairs with length equal to the number of dimensions
- A dictionary of ``{coord_name: coord}`` where the values are each a scalar value,
a 1D array or a tuple. Tuples are be in the same form as the above, and
multiple dimensions can be supplied with the form ``(dims, data[, attrs])``.
Supplying as a tuple allows other coordinates than those corresponding to
dimensions (more on these later).
- A list of values with length equal to the number of dimensions, providing
coordinate labels for each dimension. Each value must be of one of the
following forms:

* A :py:class:`~xarray.DataArray` or :py:class:`~xarray.Variable`
* A tuple of the form ``(dims, data[, attrs])``, which is converted into
arguments for :py:class:`~xarray.Variable`
* A pandas object or scalar value, which is converted into a ``DataArray``
* A 1D array or list, which is interpreted as values for a one dimensional
coordinate variable along the same dimension as it's name

- A dictionary of ``{coord_name: coord}`` where values are of the same form
as the list. Supplying coordinates as a dictionary allows other coordinates
than those corresponding to dimensions (more on these later). If you supply
``coords`` as a dictionary, you must explicitly provide ``dims``.

As a list of tuples:

Expand Down Expand Up @@ -128,7 +143,7 @@ Let's take a look at the important properties on our array:
foo.attrs
print(foo.name)

You can even modify ``values`` inplace:
You can modify ``values`` inplace:

.. ipython:: python

Expand Down Expand Up @@ -228,15 +243,19 @@ Creating a Dataset
To make an :py:class:`~xarray.Dataset` from scratch, supply dictionaries for any
variables (``data_vars``), coordinates (``coords``) and attributes (``attrs``).

``data_vars`` are supplied as a dictionary with each key as the name of the variable and each
- ``data_vars`` should be a dictionary with each key as the name of the variable and each
value as one of:

- A :py:class:`~xarray.DataArray`
- A tuple of the form ``(dims, data[, attrs])``
- A pandas object
* A :py:class:`~xarray.DataArray` or :py:class:`~xarray.Variable`
* A tuple of the form ``(dims, data[, attrs])``, which is converted into
arguments for :py:class:`~xarray.Variable`
* A pandas object, which is converted into a ``DataArray``
* A 1D array or list, which is interpreted as values for a one dimensional
coordinate variable along the same dimension as it's name

- ``coords`` should be a dictionary of the same form as ``data_vars``.

``coords`` are supplied as dictionary of ``{coord_name: coord}`` where the values are scalar values,
arrays or tuples in the form of ``(dims, data[, attrs])``.
- ``attrs`` should be a dictionary.

Let's create some fake data for the example we show above:

Expand All @@ -257,10 +276,6 @@ Let's create some fake data for the example we show above:
'reference_time': pd.Timestamp('2014-09-05')})
ds

Notice that we did not explicitly include coordinates for the "x" or "y"
dimensions, so they were filled in array of ascending integers of the proper
length.

Here we pass :py:class:`xarray.DataArray` objects or a pandas object as values
in the dictionary:

Expand Down
50 changes: 37 additions & 13 deletions doc/examples/quick-overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ array or list, with optional *dimensions* and *coordinates*:
.. ipython:: python

xr.DataArray(np.random.randn(2, 3))
data = xr.DataArray(np.random.randn(2, 3), [('x', ['a', 'b']), ('y', [-2, 0, 2])])
data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('x', 'y'))
data

If you supply a pandas :py:class:`~pandas.Series` or
Expand Down Expand Up @@ -121,31 +121,55 @@ xarray supports grouped operations using a very similar API to pandas:
data.groupby(labels).mean('y')
data.groupby(labels).apply(lambda x: x - x.min())

Convert to pandas
-----------------
pandas
------

A key feature of xarray is robust conversion to and from pandas objects:
Xarray objects can be easily converted to and from pandas objects:

.. ipython:: python

data.to_series()
data.to_pandas()
series = data.to_series()
series

Datasets and NetCDF
-------------------
# convert back
series.to_xarray()

:py:class:`xarray.Dataset` is a dict-like container of ``DataArray`` objects that share
index labels and dimensions. It looks a lot like a netCDF file:
Datasets
--------

:py:class:`xarray.Dataset` is a dict-like container of aligned ``DataArray``
objects. You can think of it as a multi-dimensional generalization of the
:py:class:`pandas.DataFrame`:

.. ipython:: python

ds = data.to_dataset(name='foo')
ds = xr.Dataset({'foo': data, 'bar': ('x', [1, 2]), 'baz': np.pi})
ds

Use dictionary indexing to pull out ``Dataset`` variables as ``DataArray``
objects:

.. ipython:: python

ds['foo']

Variables in datasets can have different ``dtype`` and even different
dimensions, but all dimensions are assumed to refer to points in the same shared
coordinate system.

You can do almost everything you can do with ``DataArray`` objects with
``Dataset`` objects if you prefer to work with multiple variables at once.
``Dataset`` objects (including indexing and arithmetic) if you prefer to work
with multiple variables at once.

NetCDF
------

NetCDF is the recommended binary serialization format for xarray objects. Users
from the geosciences will recognize that the :py:class:`~xarray.Dataset` data
model looks very similar to a netCDF file (which, in fact, inspired it).

Datasets also let you easily read and write netCDF files:
You can directly read and write xarray objects to disk using :py:meth:`~xarray.Dataset.to_netcdf`, :py:func:`~xarray.open_dataset` and
:py:func:`~xarray.open_dataarray`:

.. ipython:: python

Expand Down
34 changes: 33 additions & 1 deletion doc/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ enabling nearest neighbor (inexact) lookups by use of the methods ``'pad'``,

.. ipython:: python

data = xr.DataArray([1, 2, 3], dims='x')
data = xr.DataArray([1, 2, 3], [('x', [0, 1, 2])])
data.sel(x=[1.1, 1.9], method='nearest')
data.sel(x=0.1, method='backfill')
data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
Expand Down Expand Up @@ -478,6 +478,30 @@ Both ``reindex_like`` and ``align`` work interchangeably between
# this is a no-op, because there are no shared dimension names
ds.reindex_like(other)

.. _indexing.missing_coordinates:

Missing coordinate labels
-------------------------

Coordinate labels for each dimension are optional (as of xarray v0.9). Label
based indexing with ``.sel`` and ``.loc`` uses standard positional,
integer-based indexing as a fallback for dimensions without a coordinate label:

.. ipython:: python

array = xr.DataArray([1, 2, 3], dims='x')
array.sel(x=[0, -1])

Alignment between xarray objects where one or both do not have coordinate labels
succeeds only if all dimensions of the same name have the same length.
Otherwise, it raises an informative error:

.. ipython::
:verbatim:

In [62]: xr.align(array, array[:2])
ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {2, 3}

Underlying Indexes
------------------

Expand All @@ -491,3 +515,11 @@ through the :py:attr:`~xarray.DataArray.indexes` attribute.
arr.indexes
arr.indexes['time']

Use :py:meth:`~xarray.DataArray.get_index` to get an index for a dimension,
falling back to a default :py:class:`pandas.RangeIndex` if it has no coordinate
labels:

.. ipython:: python

array
array.get_index('x')
33 changes: 31 additions & 2 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,32 @@ v0.9.0 (unreleased)
Breaking changes
~~~~~~~~~~~~~~~~

- Index coordinates for each dimensions are now optional, and no longer created
by default :issue:`1017`. This has a number of implications:

- :py:func:`~align` and :py:meth:`~Dataset.reindex` can now error, if
dimensions labels are missing and dimensions have different sizes.
- Because pandas does not support missing indexes, methods such as
``to_dataframe``/``from_dataframe`` and ``stack``/``unstack`` no longer
roundtrip faithfully on all inputs. Use :py:meth:`~Dataset.reset_index` to
remove undesired indexes.
- ``Dataset.__delitem__`` and :py:meth:`~Dataset.drop` no longer delete/drop
variables that have dimensions matching a deleted/dropped variable.
- ``DataArray.coords.__delitem__`` is now allowed on variables matching
dimension names.
- ``.sel`` and ``.loc`` now handle indexing along a dimension without
coordinate labels by doing integer based indexing. See
:ref:`indexing.missing_coordinates` for an example.
- :py:attr:`~Dataset.indexes` is no longer guaranteed to include all
dimensions names as keys. The new method :py:meth:`~Dataset.get_index` has
been added to get an index for a dimension guaranteed, falling back to
produce a default ``RangeIndex`` if necessary.

- The default behavior of ``merge`` is now ``compat='no_conflicts'``, so some
merges will now succeed in cases that previously raised
``xarray.MergeError``. Set ``compat='broadcast_equals'`` to restore the
previous default.
previous default. See :ref:`combining.no_conflicts` for more details.

- Reading :py:attr:`~DataArray.values` no longer always caches values in a NumPy
array :issue:`1128`. Caching of ``.values`` on variables read from netCDF
files on disk is still the default when :py:func:`open_dataset` is called with
Expand Down Expand Up @@ -150,6 +172,13 @@ Bug fixes
should be computed or not.
By `Fabien Maussion <https://github.com/fmaussion>`_.

- Grouping over an dimension with non-unique values with ``groupby`` gives
correct groups.
By `Stephan Hoyer <https://github.com/shoyer>`_.

- Fixed accessing coordinate variables with non-string names from ``.coords``.
By `Stephan Hoyer <https://github.com/shoyer>`_.

- :py:meth:`~xarray.DataArray.rename` now simultaneously renames the array and
any coordinate with the same name, when supplied via a :py:class:`dict`
(:issue:`1116`).
Expand Down Expand Up @@ -1280,7 +1309,7 @@ Enhancements

.. ipython:: python

data = xray.DataArray([1, 2, 3], dims='x')
data = xray.DataArray([1, 2, 3], [('x', range(3))])
data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')

This will be especially useful once pandas 0.16 is released, at which point
Expand Down
25 changes: 0 additions & 25 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,25 +33,6 @@ def _decode_variable_name(name):
return name


def is_trivial_index(var):
"""
Determines if in index is 'trivial' meaning that it is
equivalent to np.arange(). This is determined by
checking if there are any attributes or encodings,
if ndims is one, dtype is int and finally by comparing
the actual values to np.arange()
"""
# if either attributes or encodings are defined
# the index is not trivial.
if len(var.attrs) or len(var.encoding):
return False
# if the index is not a 1d integer array
if var.ndim > 1 or not var.dtype.kind == 'i':
return False
arange = np.arange(var.size, dtype=var.dtype)
return np.all(var.values == arange)


def robust_getitem(array, key, catch=Exception, max_retries=6,
initial_delay=500):
"""
Expand Down Expand Up @@ -203,12 +184,6 @@ def store_dataset(self, dataset):

def store(self, variables, attributes, check_encoding_set=frozenset()):
self.set_attributes(attributes)
neccesary_dims = [v.dims for v in variables.values()]
neccesary_dims = set(itertools.chain(*neccesary_dims))
# set all non-indexes and any index which is not trivial.
variables = OrderedDict((k, v) for k, v in iteritems(variables)
if not (k in neccesary_dims and
is_trivial_index(v)))
self.set_variables(variables, check_encoding_set)

def set_attributes(self, attributes):
Expand Down
4 changes: 2 additions & 2 deletions xarray/conventions.py
Original file line number Diff line number Diff line change
Expand Up @@ -913,7 +913,7 @@ def decode_cf(obj, concat_characters=True, mask_and_scale=True,
identify coordinates.
drop_variables: string or iterable, optional
A variable or list of variables to exclude from being parsed from the
dataset.This may be useful to drop variables with problems or
dataset. This may be useful to drop variables with problems or
inconsistent values.

Returns
Expand All @@ -939,7 +939,7 @@ def decode_cf(obj, concat_characters=True, mask_and_scale=True,
vars, attrs, concat_characters, mask_and_scale, decode_times,
decode_coords, drop_variables=drop_variables)
ds = Dataset(vars, attrs=attrs)
ds = ds.set_coords(coord_names.union(extra_coords))
ds = ds.set_coords(coord_names.union(extra_coords).intersection(vars))
ds._file_obj = file_obj
return ds

Expand Down
Loading