IEP 1 #1988

rhattersley · 2016-04-27T14:36:10Z

IEP => Iris enhancement proposal

A replacement for the content in the wiki so we can try pointing to bits of it and discussing them.

Preview the latest version at: https://github.com/rhattersley/iris/blob/iep1/docs/iris/src/IEP/IEP001.adoc

Ping @SciTools/iris-devs, @ocefpaf, @rsignell-usgs

ocefpaf · 2016-04-27T18:16:45Z

@rhattersley I am copying ioos/APIRUS#10 (comment) here

I prefer the pandas style vs the xarray style simply because it has been out there for a longer period of time.
I would be nice if the time slice could take string and datetime dates like pandas instead of the specialized PartialDateTime object. (Compare cells [4] and [6] of this notebook.)
There was a point about date slices being inclusive or not. Inclusive is non-pythonic, but makes a lot of sense when slicing dates. That is why pandas broke from Python standard and implemented it. So I am 👍 to inclusive.

rsignell-usgs · 2016-04-28T09:50:22Z

The notebook showing 60x performance of xarray over iris for a simple time slicing operation is at
https://gist.github.com/rsignell-usgs/13d7ce9d95fddb4983d4cbf98be6c71d

pp-mo · 2016-04-28T10:18:31Z

I'm a bit disappointed that there is no support in your examples for selecting over ranges of a coordinate

e.g. something equivalent to "slice the cube on 'x_coord', all points between values 20 and 50"
- or even, "slice on 'x_coord' betweeen indices 3 and 8".

(You'll note I am careful to avoid suggesting a usable syntax !)

Is this an intentional omission ?

rhattersley · 2016-04-28T11:50:12Z

Firstly, thanks for taking a look!

I'm a bit disappointed that there is no support in your examples for selecting over ranges of a coordinate

I pushed up an updated version just a few minutes before your comment which starts to address this. (I'm guessing you were reviewing the previous version in which the only hint at this was the cryptic entry "Inclusive vs. exclusive" in the TODO section.)

In general, this is not a finished document hence the (non-exhaustive) "TODO" section! Contributions are very welcome!

pp-mo · 2016-05-05T13:29:33Z

@ocefpaf I prefer the pandas style vs the xarray style simply because it has been out there for a longer period of time.

However, it is really essential for us to support indexing by named dimension instead of just dimension order, as the xarray docs have it :

Similarly to pandas objects, xarray objects support both integer and label based lookups along each dimension. However, xarray objects also have named dimensions, so you can optionally use dimension names instead of relying on the positional ordering of dimensions.
[[from http://xarray.pydata.org/en/stable/indexing.html]]

Unless I'm out of date, and this pandas limitation has changed somehow ?

ocefpaf · 2016-05-05T13:38:48Z

@pp-mo I meant the name of the method iloc/loc(pandas style) vs sel/isel/loc (xarray style).

The indexing by named dimension (or extended pandas style in the document) is essential!

pp-mo · 2016-05-05T13:46:35Z

@ocefpaf extended pandas style

Sorry, quite clear, I just hadn't read it all + properly understood it !

pp-mo · 2016-05-05T15:53:57Z

Some random thoughts on syntax ..

Using dictionary keys or named arguments to identify coordinates means we cannot use the getitem-specific from:to:by syntax.
But we can provide a simple indexable object to allow that, to convert an indexing syntax to a slice construction.

E.G.
define a helper object "ss" so that ss[a:b:c] produces slice(a, b, c).
The "ss" object converts a single indexing element to a slice (it's just sugar)
(it's a tiny bit like the numpy ogrid and mgrid objects)

Then we can write, for instance ...
by index :

cube.isel(height=ss[:4], longitude=ss[20:40)
 ==(say)  cube[:, :4, :, 20:40]

or probably more usefully, by value :

cube.sel(longitude=ss[12.0:73.4])
 == cube.extract(Constraint(longitude=lambda x: 12.0 <=x <= 73.4))
 ==(say)  cube[:, 14:19]

I like this because it preserves a more readable form in the code, especially for items like "from:", ":to" and ":to:by" ...

But here's another possibility: We can define two different helper objects to represent selection by index and value : If these produce distinct output types (i.e. not just slice objects), we can then have a single common selector function which distinguishes between index- and value- selecting inputs, and hence you wouldn't need separate methods for these.
In fact, you can then freely mix by-index and by-value slicings, for instance :
cube.sel(time=ii[:2], height=10.0, longitude=vv[180:], latitude=vv[35:75])
which could be equivalent to say, cube[7, :2, 20:27, 30:]

(assuming :

"ii" produces an index-slice and "vv" a value-slice
a bare value behaves like vv[x], to select a single coordinate value
)

pp-mo · 2016-05-05T22:31:17Z

different helper objects

Sorry for thinking aloud...
It only just occurred to me that with the suggested special provision of indexing 'helper functions', we could make those include the coordinate identifier as well, so we wouldn't even need a selection method at all.
So instead of :
cube.sel(time=ii[:2], height=10.0, longitude=vv[180:], latitude=vv[35:75])
we could have ...
cube[ii('time')[:2], vv('height')[10.0], vv('longitude')[180:], vv('latitude')[35:75]]

Though TBH I'm not sure that is actually preferable for readability.

rhattersley · 2016-05-06T08:34:23Z

define a helper object "ss" so that ss[a:b:c] produces slice(a, b, c).

NumPy does almost exactly that with np.s_ and np.index_exp.

pp-mo · 2016-05-06T10:15:03Z

Regarding the distinction between orthogonal and vectorised indexing, to use the terminology used by xarray, here:
I think this is a bit cryptic in numpy, whereas Iris has the key advantage of dimension identities to make usage clearer.

The natural interpretation IMHO is to treat it like this:

orthogonal, as already discussed

access has the style "x_dim:(x_values), y_dim:(y_values)",
the values are 1d with arbitrary lengths nx, ny
the result then has typical dims [..., ny, nx, ...]

vectorised, by contrast:

access has the style "(x_dim, y_dim, .., dim_N):coord_values", with N dims involved
the values have the shape [M, N], such that coord_values[i] == [x_val, y_val, ..., val_N]
the result has dims (M,)

N.B. this last also has a natural extension where the size M becomes multidimensional.
This is what numpy does with its "full arrays of indices" approach (like the last example here).

This approach is a bit like what we currently have in Iris trajectory interpolation.
It's different in that I'm assuming that we would want to specify the coordinate (aka dimension) sequence separately from the points values, so that the selection points can be just an array.
That's not an essential difference, but I think it makes sense in view of the requirement that the individual points values must have the same length (or in the extended case, shape).

marqh · 2016-05-06T10:41:15Z

docs/iris/src/IEP/IEP001.adoc

+a|[source,python]
+----
+cube[dict(height=2)]
+cube.iloc[dict(height=2)]


is iloc a string which could be used in a

cube.coord('iloc')

If so, what coord names are allowed that would not be able to support this?
coord.name() could return a long_name, with space in, for example

oh, I see more now, loc and iloc are special functions (catching up slowly)

so, my comments are related to the string 'height'

Yes, the names are the same as for cube.coord(...).

A name that is not a valid Python identifier can be expressed with a dict literal, e.g.:

cube.iloc[{'nasty name': 12}]

This shouldn't be an issue for general purpose code, which is probably doing the equivalent of:

# Build up a dict with the desired selection criteria criteria = {} for thing in some_things: ... criteria[foo] = bar result = cube.iloc[criteria]

pelson · 2017-02-01T00:12:17Z

An alternative proposal:

Pick off model levels 25, 26, 27 (value based):

cube['model_level_number': range(25, 28)]

Use a function to do filtering with (value based):

my_func = lambda cell: 25 < cell.point <= 50
cube[coord_instance: my_func]

Use a function to do filtering with (value based) for time:

time_func = lambda cell: cell.point <= datetime.datetime(2017, 2, 1)
cube['time': time_func]

Syntactically, there is no reason to add magic indexing properties - we are able to express this uniquely through square bracket form alone. The only reason that we may choose to have separate properties is to avoid complexity within __getitem__ and to provide explicitly separate interfaces (to reduce complexity of documentation/understanding).

One problem that we are trying to workaround with all of our proposals is that we don't really have an object that represents a coordinate in the context of the cube that is lives on (quite rightly, coordinates themselves do not hold a (circular) reference back to the cube). There is nothing to say we can't have such a thing though...

class CubeCoordContext(object):
    def __init__(self, cube, coord):

Suppose, for a second, such a thing were returned by cube['coord_name'/coord_instance].
We could add behaviour to such a concept to allow us to do coordinate level operations to the cube.

# Index based subsetting
cube['time'][0: 10]

In reality, this thing could be the object that actually does the mapping of coordinate to data dimension for the cube (so a whole heap of (currently ugly) logic for dimension mapping could be moved into this object).

>>> cube[coord_instance].dims
<the dimensions of the cube that the coordinate maps to>

(Pushing my luck) We might even proxy all attribute access to the coordinate...

cube['time'].points

rhattersley · 2017-02-01T07:08:57Z

Syntactically, there is no reason to add magic indexing properties - we are able to express this uniquely through square bracket form alone.

If I've read your suggestion right, you're proposing to only support two of the four indexing variants:

implicit dimension with indexing by position,
and named coordinate with indexing by value.

In which case, yes, you can get away without needing indexing helper objects such as loc or iloc.

pelson · 2017-02-01T18:16:40Z

If I've read your suggestion right, you're proposing to only support two of the four indexing variants

Within getitem on the cube ([...]) that is my suggestion. For index based named coordinates (e.g. get the first time) my proposal adds a Cube+Coordinate object that can be indexed cube['time'][0: 10] - the result would be a cube (not a coordinate) that represents the first 10 time values.

The object proposed would be very similar to the iloc object that would be necessary in the alternative proposal, except this version would be bound to a single coordinate (meaning you can only do coordinate based indexing one coordinate at a time). This has the advantage of allowing other cube+coordinate behaviour, such as being the canonical place to map dimensions.

pp-mo · 2017-02-08T11:03:35Z

Not sure quite how relevant this is to the true thrust of this discussion, but ...

I played around with Python access syntax with special methods
-- i.e. __getitem__, __call__, slice syntax and comparison operators --
to see what I could achieve to deliver some neater and more usable cube indexing strategies.
The result is a working solution that uses the existing constraint mechanism (but could easily be re-factored).
See : pp-mo#40

Headline : this enables expressions like ...

one-sided value range : cube.sel(longitude=CC < 47.5)
two-sided value range : cube.sel(latitude=CC[-15:45])
get point nearest a target value : cube.sel(longitude=CC(13.7))
- or CC.nearest(x), or CC ^ x (not sure about this one.)

I thought I was going to wind up passing a dictionary to [], but instead I found myself wanting a function interface to take coordinate names as keys -- as iris.Constraint() does. So I added a "cube.sel" function (originally a different name, but I think usage parallels xarray/pandas sel).
For this, it was then also useful to create a 'helper' object that when called or indexed produces an object representing a desired indexing (as opposed to "a part of" or "a value of" the notional object itself).

One thing that emerged from this exercise is the idea of a convenient operation to select a single point by value.
I think this is important to get around problems with inexact floating-point equality. It could be implemented differently though, perhaps with reference to bounds or point spacing -- the solution in the above PR is possibly too lax.

pp-mo · 2017-02-08T11:26:18Z

A couple of points I'd like more clarity on:

(1)
I wouldn't defend the implementation of the existing Iris constraint mechanism, but I think the functionality may have some extra value : If I understand the pandas/xarray "sel" properly, I think the concept of a label is like a dimension, or "dimension coordinate" (?). It uses this exclusively as the indexing identity, whereas Iris can select by any coord.
So, does this mean that an xarray sel[from:to] is always equivalent to a slice, because the reference values are always monotonic ?
From a purely practical point of view, in our experience the identity of "the" dim coord can often be arbitrary (as between 'model_level_number' / 'height' or 'time' / 'forecast_period'), and it seems very desirable to retain the freedom to select on the coord of choice (aux or dim). The selection of non-contiguous points may also have value, but I don't quite know how that works in practice.
Have I misunderstood some of this ?

(2)
I don't see a clear benefit for the 'loc' and 'iloc' styles for Iris -- i.e. the 'normal' index-position identification of dimensions, but using coordinate values in places of indices.
Would it not be better to simply always use coordinates to identify indexing dimensions ?
What is the downside ?

Ideas @pelson @rhattersley ?

rhattersley · 2017-02-08T12:20:59Z

I don't see a clear benefit for the 'loc' and 'iloc' styles for Iris

Pragmatic brevity?

bjlittle · 2019-07-10T07:34:45Z

I'm going to merge this PR, after it hanging around for over 3 years.

I'm keen to bank the content, which is just documentation, and merging is no commitment to implementing this proposal - it's just a proposal.

rhattersley added the Status: Work in Progress label Apr 27, 2016

First draft of IEP 1.

ce05a10

rhattersley force-pushed the iep1 branch from 1607e6e to ce05a10 Compare April 27, 2016 14:47

rhattersley mentioned this pull request Apr 27, 2016

A lightweight CF model ioos/APIRUS#10

Open

rhattersley added 2 commits April 27, 2016 16:43

Add "Out of scope" and "Work required"

e080ea0

Names that aren't valid Python identifiers

8f6b536

Slice behaviour and misc clarifications

9933395

marqh reviewed May 6, 2016
View reviewed changes

marqh removed the Status: Work in Progress label Jul 15, 2016

pp-mo mentioned this pull request Oct 17, 2017

Design a common strategy for handling masked data in file i/o #2748

Closed

pelson added the Status: Decision Required label Oct 24, 2017

pelson mentioned this pull request Feb 14, 2018

Loss of 'normal numbers' for times cells is limiting #2953

Closed

bjlittle merged commit 7440781 into SciTools:master Jul 10, 2019

trexfeathers mentioned this pull request Jan 20, 2021

Changing values in iris cube based on coordinates instead of index #3948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IEP 1 #1988

IEP 1 #1988

rhattersley commented Apr 27, 2016 •

edited

Loading

ocefpaf commented Apr 27, 2016

rsignell-usgs commented Apr 28, 2016

pp-mo commented Apr 28, 2016

rhattersley commented Apr 28, 2016 •

edited

Loading

pp-mo commented May 5, 2016

ocefpaf commented May 5, 2016

pp-mo commented May 5, 2016

pp-mo commented May 5, 2016 •

edited

Loading

pp-mo commented May 5, 2016

rhattersley commented May 6, 2016

pp-mo commented May 6, 2016 •

edited

Loading

marqh May 6, 2016 •

edited

Loading

marqh May 6, 2016

rhattersley May 7, 2016

pelson commented Feb 1, 2017

rhattersley commented Feb 1, 2017

pelson commented Feb 1, 2017

pp-mo commented Feb 8, 2017 •

edited

Loading

pp-mo commented Feb 8, 2017 •

edited

Loading

rhattersley commented Feb 8, 2017

bjlittle commented Jul 10, 2019

IEP 1 #1988

IEP 1 #1988

Conversation

rhattersley commented Apr 27, 2016 • edited Loading

ocefpaf commented Apr 27, 2016

rsignell-usgs commented Apr 28, 2016

pp-mo commented Apr 28, 2016

rhattersley commented Apr 28, 2016 • edited Loading

pp-mo commented May 5, 2016

ocefpaf commented May 5, 2016

pp-mo commented May 5, 2016

pp-mo commented May 5, 2016 • edited Loading

pp-mo commented May 5, 2016

rhattersley commented May 6, 2016

pp-mo commented May 6, 2016 • edited Loading

marqh May 6, 2016 • edited Loading

Choose a reason for hiding this comment

marqh May 6, 2016

Choose a reason for hiding this comment

rhattersley May 7, 2016

Choose a reason for hiding this comment

pelson commented Feb 1, 2017

rhattersley commented Feb 1, 2017

pelson commented Feb 1, 2017

pp-mo commented Feb 8, 2017 • edited Loading

pp-mo commented Feb 8, 2017 • edited Loading

rhattersley commented Feb 8, 2017

bjlittle commented Jul 10, 2019

rhattersley commented Apr 27, 2016 •

edited

Loading

rhattersley commented Apr 28, 2016 •

edited

Loading

pp-mo commented May 5, 2016 •

edited

Loading

pp-mo commented May 6, 2016 •

edited

Loading

marqh May 6, 2016 •

edited

Loading

pp-mo commented Feb 8, 2017 •

edited

Loading

pp-mo commented Feb 8, 2017 •

edited

Loading