Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IEP 1 #1988

Merged
merged 4 commits into from
Jul 10, 2019
Merged

IEP 1 #1988

merged 4 commits into from
Jul 10, 2019

Conversation

rhattersley
Copy link
Member

@rhattersley rhattersley commented Apr 27, 2016

IEP => Iris enhancement proposal

A replacement for the content in the wiki so we can try pointing to bits of it and discussing them.

Preview the latest version at: https://github.com/rhattersley/iris/blob/iep1/docs/iris/src/IEP/IEP001.adoc

Ping @SciTools/iris-devs, @ocefpaf, @rsignell-usgs

@ocefpaf
Copy link
Member

ocefpaf commented Apr 27, 2016

@rhattersley I am copying ioos/APIRUS#10 (comment) here

  • I prefer the pandas style vs the xarray style simply because it has been out there for a longer period of time.
  • I would be nice if the time slice could take string and datetime dates like pandas instead of the specialized PartialDateTime object. (Compare cells [4] and [6] of this notebook.)
  • There was a point about date slices being inclusive or not. Inclusive is non-pythonic, but makes a lot of sense when slicing dates. That is why pandas broke from Python standard and implemented it. So I am 👍 to inclusive.

@rsignell-usgs
Copy link

The notebook showing 60x performance of xarray over iris for a simple time slicing operation is at
https://gist.github.com/rsignell-usgs/13d7ce9d95fddb4983d4cbf98be6c71d

@pp-mo
Copy link
Member

pp-mo commented Apr 28, 2016

I'm a bit disappointed that there is no support in your examples for selecting over ranges of a coordinate

  • e.g. something equivalent to "slice the cube on 'x_coord', all points between values 20 and 50"
  • - or even, "slice on 'x_coord' betweeen indices 3 and 8".

(You'll note I am careful to avoid suggesting a usable syntax !)

Is this an intentional omission ?

@rhattersley
Copy link
Member Author

rhattersley commented Apr 28, 2016

Firstly, thanks for taking a look!

I'm a bit disappointed that there is no support in your examples for selecting over ranges of a coordinate

I pushed up an updated version just a few minutes before your comment which starts to address this. (I'm guessing you were reviewing the previous version in which the only hint at this was the cryptic entry "Inclusive vs. exclusive" in the TODO section.)

In general, this is not a finished document hence the (non-exhaustive) "TODO" section! Contributions are very welcome!

@pp-mo
Copy link
Member

pp-mo commented May 5, 2016

@ocefpaf I prefer the pandas style vs the xarray style simply because it has been out there for a longer period of time.

However, it is really essential for us to support indexing by named dimension instead of just dimension order, as the xarray docs have it :

Similarly to pandas objects, xarray objects support both integer and label based lookups along each dimension. However, xarray objects also have named dimensions, so you can optionally use dimension names instead of relying on the positional ordering of dimensions.
[[from http://xarray.pydata.org/en/stable/indexing.html]]

Unless I'm out of date, and this pandas limitation has changed somehow ?

@ocefpaf
Copy link
Member

ocefpaf commented May 5, 2016

@pp-mo I meant the name of the method iloc/loc(pandas style) vs sel/isel/loc (xarray style).

The indexing by named dimension (or extended pandas style in the document) is essential!

@pp-mo
Copy link
Member

pp-mo commented May 5, 2016

@ocefpaf extended pandas style

Sorry, quite clear, I just hadn't read it all + properly understood it !

@pp-mo
Copy link
Member

pp-mo commented May 5, 2016

Some random thoughts on syntax ..

Using dictionary keys or named arguments to identify coordinates means we cannot use the getitem-specific from:to:by syntax.
But we can provide a simple indexable object to allow that, to convert an indexing syntax to a slice construction.

E.G.
define a helper object "ss" so that ss[a:b:c] produces slice(a, b, c).
The "ss" object converts a single indexing element to a slice (it's just sugar)
(it's a tiny bit like the numpy ogrid and mgrid objects)

Then we can write, for instance ...
by index :

cube.isel(height=ss[:4], longitude=ss[20:40)
 ==(say)  cube[:, :4, :, 20:40]

or probably more usefully, by value :

cube.sel(longitude=ss[12.0:73.4])
 == cube.extract(Constraint(longitude=lambda x: 12.0 <=x <= 73.4))
 ==(say)  cube[:, 14:19]

I like this because it preserves a more readable form in the code, especially for items like "from:", ":to" and ":to:by" ...

But here's another possibility: We can define two different helper objects to represent selection by index and value : If these produce distinct output types (i.e. not just slice objects), we can then have a single common selector function which distinguishes between index- and value- selecting inputs, and hence you wouldn't need separate methods for these.
In fact, you can then freely mix by-index and by-value slicings, for instance :
cube.sel(time=ii[:2], height=10.0, longitude=vv[180:], latitude=vv[35:75])
which could be equivalent to say, cube[7, :2, 20:27, 30:]

(assuming :

  • "ii" produces an index-slice and "vv" a value-slice
  • a bare value behaves like vv[x], to select a single coordinate value
    )

@pp-mo
Copy link
Member

pp-mo commented May 5, 2016

different helper objects

Sorry for thinking aloud...
It only just occurred to me that with the suggested special provision of indexing 'helper functions', we could make those include the coordinate identifier as well, so we wouldn't even need a selection method at all.
So instead of :
cube.sel(time=ii[:2], height=10.0, longitude=vv[180:], latitude=vv[35:75])
we could have ...
cube[ii('time')[:2], vv('height')[10.0], vv('longitude')[180:], vv('latitude')[35:75]]

Though TBH I'm not sure that is actually preferable for readability.

@rhattersley
Copy link
Member Author

define a helper object "ss" so that ss[a:b:c] produces slice(a, b, c).

NumPy does almost exactly that with np.s_ and np.index_exp.

@pp-mo
Copy link
Member

pp-mo commented May 6, 2016

Regarding the distinction between orthogonal and vectorised indexing, to use the terminology used by xarray, here:
I think this is a bit cryptic in numpy, whereas Iris has the key advantage of dimension identities to make usage clearer.

The natural interpretation IMHO is to treat it like this:

orthogonal, as already discussed

  • access has the style "x_dim:(x_values), y_dim:(y_values)",
  • the values are 1d with arbitrary lengths nx, ny
  • the result then has typical dims [..., ny, nx, ...]

vectorised, by contrast:

  • access has the style "(x_dim, y_dim, .., dim_N):coord_values", with N dims involved
  • the values have the shape [M, N], such that coord_values[i] == [x_val, y_val, ..., val_N]
  • the result has dims (M,)

N.B. this last also has a natural extension where the size M becomes multidimensional.
This is what numpy does with its "full arrays of indices" approach (like the last example here).

This approach is a bit like what we currently have in Iris trajectory interpolation.
It's different in that I'm assuming that we would want to specify the coordinate (aka dimension) sequence separately from the points values, so that the selection points can be just an array.
That's not an essential difference, but I think it makes sense in view of the requirement that the individual points values must have the same length (or in the extended case, shape).

a|[source,python]
----
cube[dict(height=2)]
cube.iloc[dict(height=2)]
Copy link
Member

@marqh marqh May 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is iloc a string which could be used in a

cube.coord('iloc')

If so, what coord names are allowed that would not be able to support this?
coord.name() could return a long_name, with space in, for example

oh, I see more now, loc and iloc are special functions (catching up slowly)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, my comments are related to the string 'height'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the names are the same as for cube.coord(...).

A name that is not a valid Python identifier can be expressed with a dict literal, e.g.:

cube.iloc[{'nasty name': 12}]

This shouldn't be an issue for general purpose code, which is probably doing the equivalent of:

# Build up a dict with the desired selection criteria
criteria = {}
for thing in some_things:
    ...
    criteria[foo] = bar
result = cube.iloc[criteria]

@pelson
Copy link
Member

pelson commented Feb 1, 2017

An alternative proposal:

Pick off model levels 25, 26, 27 (value based):

cube['model_level_number': range(25, 28)]

Use a function to do filtering with (value based):

my_func = lambda cell: 25 < cell.point <= 50
cube[coord_instance: my_func]

Use a function to do filtering with (value based) for time:

time_func = lambda cell: cell.point <= datetime.datetime(2017, 2, 1)
cube['time': time_func]

Syntactically, there is no reason to add magic indexing properties - we are able to express this uniquely through square bracket form alone. The only reason that we may choose to have separate properties is to avoid complexity within __getitem__ and to provide explicitly separate interfaces (to reduce complexity of documentation/understanding).


One problem that we are trying to workaround with all of our proposals is that we don't really have an object that represents a coordinate in the context of the cube that is lives on (quite rightly, coordinates themselves do not hold a (circular) reference back to the cube). There is nothing to say we can't have such a thing though...

class CubeCoordContext(object):
    def __init__(self, cube, coord):

Suppose, for a second, such a thing were returned by cube['coord_name'/coord_instance].
We could add behaviour to such a concept to allow us to do coordinate level operations to the cube.

# Index based subsetting
cube['time'][0: 10]

In reality, this thing could be the object that actually does the mapping of coordinate to data dimension for the cube (so a whole heap of (currently ugly) logic for dimension mapping could be moved into this object).

>>> cube[coord_instance].dims
<the dimensions of the cube that the coordinate maps to>

(Pushing my luck) We might even proxy all attribute access to the coordinate...

cube['time'].points

@rhattersley
Copy link
Member Author

Syntactically, there is no reason to add magic indexing properties - we are able to express this uniquely through square bracket form alone.

If I've read your suggestion right, you're proposing to only support two of the four indexing variants:

  1. implicit dimension with indexing by position,
  2. and named coordinate with indexing by value.

In which case, yes, you can get away without needing indexing helper objects such as loc or iloc.

@pelson
Copy link
Member

pelson commented Feb 1, 2017

If I've read your suggestion right, you're proposing to only support two of the four indexing variants

Within getitem on the cube ([...]) that is my suggestion. For index based named coordinates (e.g. get the first time) my proposal adds a Cube+Coordinate object that can be indexed cube['time'][0: 10] - the result would be a cube (not a coordinate) that represents the first 10 time values.

The object proposed would be very similar to the iloc object that would be necessary in the alternative proposal, except this version would be bound to a single coordinate (meaning you can only do coordinate based indexing one coordinate at a time). This has the advantage of allowing other cube+coordinate behaviour, such as being the canonical place to map dimensions.

@pp-mo
Copy link
Member

pp-mo commented Feb 8, 2017

Not sure quite how relevant this is to the true thrust of this discussion, but ...

I played around with Python access syntax with special methods
-- i.e. __getitem__, __call__, slice syntax and comparison operators --
to see what I could achieve to deliver some neater and more usable cube indexing strategies.
The result is a working solution that uses the existing constraint mechanism (but could easily be re-factored).
See : pp-mo#40

Headline : this enables expressions like ...

  • one-sided value range : cube.sel(longitude=CC < 47.5)
  • two-sided value range : cube.sel(latitude=CC[-15:45])
  • get point nearest a target value : cube.sel(longitude=CC(13.7))
    • or CC.nearest(x), or CC ^ x (not sure about this one.)

I thought I was going to wind up passing a dictionary to [], but instead I found myself wanting a function interface to take coordinate names as keys -- as iris.Constraint() does. So I added a "cube.sel" function (originally a different name, but I think usage parallels xarray/pandas sel).
For this, it was then also useful to create a 'helper' object that when called or indexed produces an object representing a desired indexing (as opposed to "a part of" or "a value of" the notional object itself).

One thing that emerged from this exercise is the idea of a convenient operation to select a single point by value.
I think this is important to get around problems with inexact floating-point equality. It could be implemented differently though, perhaps with reference to bounds or point spacing -- the solution in the above PR is possibly too lax.

@pp-mo
Copy link
Member

pp-mo commented Feb 8, 2017

A couple of points I'd like more clarity on:

(1)
I wouldn't defend the implementation of the existing Iris constraint mechanism, but I think the functionality may have some extra value : If I understand the pandas/xarray "sel" properly, I think the concept of a label is like a dimension, or "dimension coordinate" (?). It uses this exclusively as the indexing identity, whereas Iris can select by any coord.
So, does this mean that an xarray sel[from:to] is always equivalent to a slice, because the reference values are always monotonic ?
From a purely practical point of view, in our experience the identity of "the" dim coord can often be arbitrary (as between 'model_level_number' / 'height' or 'time' / 'forecast_period'), and it seems very desirable to retain the freedom to select on the coord of choice (aux or dim). The selection of non-contiguous points may also have value, but I don't quite know how that works in practice.
Have I misunderstood some of this ?

(2)
I don't see a clear benefit for the 'loc' and 'iloc' styles for Iris -- i.e. the 'normal' index-position identification of dimensions, but using coordinate values in places of indices.
Would it not be better to simply always use coordinates to identify indexing dimensions ?
What is the downside ?

Ideas @pelson @rhattersley ?

@rhattersley
Copy link
Member Author

I don't see a clear benefit for the 'loc' and 'iloc' styles for Iris

Pragmatic brevity?

@bjlittle
Copy link
Member

I'm going to merge this PR, after it hanging around for over 3 years.

I'm keen to bank the content, which is just documentation, and merging is no commitment to implementing this proposal - it's just a proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants