RFC: indexing with multi-dimensional integer arrays #669

shoyer · 2023-07-31T04:05:39Z

I'd like to revisit the discussion from #177 about adding support for indexing with arrays of indices, because this is by far the largest missing gap in functionality required for Xarray.

My specific proposal would be to standardize indexing with a tuple of length equal to the number of array dimensions, where each element in the indexing tuple is either (1) an integer or (2) an integer array. This avoids all the messy edge cases in NumPy related to mixing slices & arrays.

The last time we talked about it, integer indexing via __getitem__ somehow got held up on discussions of mutability with __setitem__. Mutability would be worth resolving at some point, for sure, but IMO is a rather different topics.

The text was updated successfully, but these errors were encountered:

leofang · 2023-07-31T04:42:08Z

where each indexer is either (1) an integer or (2) an integer array.

What's an "indexer"? Each element in the tuple? Or the entire tuple?

seberg · 2023-07-31T08:17:23Z

Do you suggest a dedicated function for this? Or is the oindex vs normal advanced indexing behavior not a problem?

shoyer · 2023-07-31T15:40:43Z

where each indexer is either (1) an integer or (2) an integer array.

What's an "indexer"? Each element in the tuple? Or the entire tuple?

Clarified -- each element in the tuple should be an integer or integer array.

shoyer · 2023-07-31T15:42:16Z

Do you suggest a dedicated function for this? Or is the oindex vs normal advanced indexing behavior not a problem?

I think we should standardize on a subset of vectorized indexing which is common to both vindex and standard NumPy advanced indexing.

oindex is certainly occasionally useful, but vectorized indexing can substitute -- and isn't too much harder to implement once you understand how it works.

seberg · 2023-07-31T16:09:11Z

Sorry, I was wondering if any of the existing array libs chose to use oindex as the default for __getitem__, which case standardizing __getitem__() would be difficult.

asmeurer · 2023-07-31T20:18:37Z

Was there ever an analysis done of what integer array indexing the different array libraries support? I don't think it would show up in the existing API comparison data because that data only looks at function definitions.

kgryte · 2023-08-01T05:15:18Z

Was there ever an analysis done of what integer array indexing the different array libraries support?

No, not in explicit detail.

kgryte · 2023-10-19T08:28:51Z

Based on the history of NEP 21 (https://numpy.org/neps/nep-0021-advanced-indexing.html) and the following related discussions

Discussion: https://mail.python.org/pipermail/numpy-discussion/2015-April/072550.html (the first half of the thread is most relevant)
Discussion: NEP: Add proposal for oindex and vindex. numpy/numpy#6256
Discussion: DOC: major revision of NEP 21, advanced indexing numpy/numpy#11414
Orthogonal indexing: WIP: Add an orthogonally indexable attribute to ndarray numpy/numpy#5749

I wonder if we could standardize functional APIs which cater to the different indexing "flavors" instead of mandating a special form of advanced square bracket indexing. While bracket syntax is convenient in end-user code (e.g., in scripting and the REPL), functional APIs could be more readily leveraged in library code where typing extra characters is, IMO, less of an ergonomic concern. Functional APIs for retrieving elements would also avoid the mutability discussion (ref: #177).

There's precedent for such functional APIs in other languages (e.g., Julia), and one can argue that functional APIs would make intent explicit and avoid ambiguity across array libraries which should be allowed to, e.g., support "orthogonal indexing" (as in MATLAB) or some variant of NumPy's advanced vectorized indexing. If we standardized even a minimal version of NumPy's vectorized indexing semantics via bracket syntax, given conflicting semantics, this might preclude otherwise compliant array libraries from choosing to support MATLAB/Julia style indexing semantics.

Instead, I could imagine something like

def coordinate_index(x: array, *index_args: Union[int, Sequence[int, ...], array]) -> array

where index_args must be integers, a sequence of integers, and/or one-dimensional arrays of integers and the number of index_args must match of the rank of x. Similar to existing NumPy semantics, the index arguments can broadcast to a common length. This is effectively what @shoyer describes in the OP. I use coordinate_index as this is a bit more descriptive than vindex (vectorized index, as described in NEP 21) and has somewhat of a precedent in, e.g., Zarr (get_coordinate_selection). When "zipping" integer arrays, one is effectively describing index coordinates ((0,0), (2,1), ...).

I think it is also worth including orthogonal_index, given its usage in MATLAB, R, Fortran, Julia, and elsewhere. Similarly, we'd define

def orthogonal_index(x: array, *index_args: Union[int, Sequence[int, ...], array]) -> array

where index_args must again be integers, a sequence of integers, and/or one-dimensional arrays of integers, with len(index_args) == rank(x). Similar to oindex in NEP 21, each index argument would independently index the dimensions of x, matching the existing specification semantics of multi-axis indexing.

The biggest omission in the above is the absence of slice support and much of the convenience associated with ellipsis and colons. I think this can be addressed by modifying the above APIs to accept an axes kwarg.

def coordinate_index(x: array, *index_args: Union[int, Sequence[int, ...], array], axes: Optional[Sequence[int]] = None) -> array

def orthogonal_index(x: array, *index_args: Union[int, Sequence[int, ...], array], axes: Optional[Sequence[int]] = None) -> array

When axes is None, the number of index arguments must match the rank of x. When axes is an integer sequence, the number of index arguments must match the number of provided axes. For omitted axes, this would be the equivalent of the colon operator (i.e., an integer sequence specifying all elements along a dimension). This design is a generalized extension of Julia's selectdim API and, for that matter, the current take specification.

Optionally, instead of variadic interfaces, one could do something like

def coordinate_index(x: array, index_args: List[Union[int, Sequence[int, ...]], array], /, *, axes: Optional[Sequence[int]] = None) -> array

def orthogonal_index(x: array, index_args: List[Union[int, Sequence[int, ...]], array], /, *, axes: Optional[Sequence[int]] = None) -> array

where index_args must be a list of index arguments.

While the axes kwarg gets us quite far in terms of ergonomics, we still would lack a bit of the power (and problems) of NumPy's advanced "mixed" indexing semantics. For example, one would not be able to provide a slice which reverses and skips every other element, nor would one be able to use newaxis to readily insert new dimensions. However, should this sort of mixed indexing be desired, we could, potentially, either extend the above APIs to allow explicit slice objects or standardize new APIs supporting more generalized slicing. As it is, especially for arrays supporting views, the omission of generalized slicing is, at worst, an inconvenience.

>>> y = xp.orthogonal_index(x, [0], [0,1], [1,1], [2], axes=(1,3,4,6))
>>> z = y[::-1,...,xp.newaxis,::-2,:]

Another possible future extension is the support of integer index arrays having more than one dimension. As in Julia, the effect could be creation of new dimensions (i.e., the rank of the output array would be the sum of the ranks of the index arguments minus any reduced dimensions).

In short, my sense is that standardizing indexing semantics in functional APIs gives us a bit more flexibility in terms of delineating behavior and more readily incrementally evolving standardized behavior and avoids some of the roadblocks encountered with the adoption of NEP 21 and discussions around backward compatibility.

Addendum

There are also various scatter/gather APIs in PyTorch, JAX, and TensorFlow; however, there is a decent amount of divergence in terms of kwargs, etc, and IMO additional complexity beyond what we are trying to do here.

vnmabus · 2023-10-19T09:20:41Z

Sorry, maybe this is answered elsewhere but, why not making coordinate_index and orthogonal_index properties/functions returning an object with a __getitem__ method, so that you can use slices and ellipsis? E.g.:

As a property:

x.coordinate_index[[1, 3, 4], :, [7, 1, 2], ...]

As a function:

xp.coordinate_index(x)[[1, 3, 4], :, [7, 1, 2], ...]

kgryte · 2023-10-19T09:25:41Z

@vnmabus That is essentially NEP 21. Not opposed, but also that proposal was written in 2015 and still draft. There the naming conventions were oindex and vindex.

In general, in the standard, we've tried to ensure a minimal array object and moved most logic to functions.

seberg · 2023-10-19T09:31:14Z

I will note I liked orthogonal back in the day, but outer hints to the same concept and might already be a word that is being used elsewhere.

shoyer · 2023-10-20T07:47:07Z

Just a brief ntoe: NEP 21 never was implemented in NumPy, butoindex and vindex were adopted elsewhere in the Python array ecosystem, by at least Xarray, Dask, Zarr-Python and TensorStore.

I still think the minimal option of adding support for all integer arrays like NumPy in __getitem__ would probably be enough for the Array API In practice, libraries would implement this as a combination of two operations:

Broadcast indexes against each other
Coordinate based indexing.

Even libraries that don't currently support array-based indexing (like TensorFlow) could add this pretty easily as long as they have an underlying primitive for coordinate based indexing.

There are lots of variations of vectorized/orthogonal indexing, but ultimately if zip/coordinate based indexing is available, that's enough to express pretty much all desired operations -- everything else is just sugar.

seberg · 2023-10-20T08:04:04Z

The only comment I have is that I do not like the idea of having such functions in the main namespace if we think that .oindex is the obvious choice in NumPy.
I don't believe that functions make implementation meaningfully easier for NumPy. The subclass problem is pretty much identical (you either do something sensible or refuse to do it).
The apparent convenience of being able to implement it in Python isn't any harder for a method, if you cut the corner of integrating it deeper.

Now, I don't mind having the functions. I just think NumPy should have one obvious solution and that should probably be .oindex. But then, I still don't see a problem with mild difference between the NumPy main and __array_namespace__ namespaces.

EDIT: To be clear, happy to be convinced that this is useful to NumPy on its own, beyond just adding another way to do the same thing.

asmeurer · 2023-10-20T20:06:28Z

What do you mean by "the subclass problem"?

I think the functional suggestion was done for primarily two reasons:

It doesn't force one type of indexing into the main array.__getitem__.
Using __getitem__ also implies that __setitem__ will be defined, but that's more complicated.

My personal view is that it's fine to make __getitem__ default to the NumPy "coordinate" semantics. And also, despite the difficulties with __setitem__, it seems to be something that people actually need, so we will eventually have to standardize it at least optionally.

In fact, my general impression has been that standardizing __setitem__ or at least some functional equivalent is more important than advanced integer array indexing ("advanced" meaning more advanced than take). It would be good to get some feedback from scipy, scikit-learn, and others on what their needs are.

rgommers · 2024-06-27T19:24:11Z

We discussed this topic in a call today, and concluded that given the number of incoming links and demand for it, it's time to move ahead with allowing indexing with a combination of integers and integer arrays. It seems like there is enough demand and consensus that this is useful, and it looks like there are no serious problems as long as it's integer-only (combining with slices, ellipses, and boolean arrays is still not supported).

A few more outcomes and follow-up actions:

It is necessary to verify that (a) this is supported or supportable with not too much effort by all known array libraries, and (b) it needs support via array-api-tests to ensure we can actually verify the PR with the spec for this without finalizing it.
The two main concerns (coordinate vs. outer indexing, and __getitem__ vs. __setitem__) didn't seem like blockers:
- all known array libraries have numpy-style behavior, the "some new library may want to implement outer indexing in __getitem__ seems hypothetical and unlikely to occur in the future
- __getitem__ and __setitem__ can be treated independently; we have larger fish to fry with __setitem__ still (mutability), and if we do get that done in the future, all-integer indexing should be supportable as well (and if not, that's still not a real problem)
the idea of either a new functional API for this or extending take was not received very enthusiastically, because (a) it'd be a lot more work, (b) it's less readable, and (c) it wouldn't as easily be extensible further (square bracket notation allows slices)

asmeurer · 2024-06-27T19:49:09Z

Some other details from the meeting:

The primary thing that should be supported is integers + integer arrays. Mixing other index types with integer arrays such as slices, newaxes, ellipses, and boolean arrays should not be supported.
We should support the NumPy behavior where all integer and integer arrays are broadcasted together first. This allows outer indexing, and @shoyer pointed out that it's often easier for libraries to make optimizations if index arrays are left unbroadcasted by the user. However, we should verify that these semantics are fully supported by existing libraries.
We should consider allowing an ellipsis to precede all integer arrays (like x[..., 0, arr]). We should determine if this is well supported among existing libraries.

We can always implement the minimal semantics now and expand what else is supported later. For instance, potentially in the future we could allow ellipses, slices, and newaxes to be mixed with integer arrays as long as they are all either at the start or end of the index (the confusing case in NumPy is when integer arrays are on either side of slices, which I think everyone agrees we should never try to standardize). Slices in particular might be important to support as that allows inserting empty slices in the index to select over a particular axis (like x[:, :, idx]).

@shoyer pointed out that a downside to implementing only a subset of NumPy behavior is that it isn't obvious to users what indexing semantics are and aren't part of the standard, but 1) this is already the case (for instance, the standard only allows a single boolean array index, and the standard leaves out-of-bounds integer and slices unspecified), and 2) users can check this using array-api-strict, which errors on unspecified indexing behavior.

asmeurer · 2024-06-27T20:04:52Z

Also it's probably worth pointing out: if any behavior isn't implemented in a library, we can't easily work around it in the compat library (the best we could do would be to add a helper function). So we should take that into consideration if anything actually isn't implemented somewhere.

jni · 2024-06-28T01:29:44Z

For instance, potentially in the future we could allow ellipses, slices, and newaxes to be mixed with integer arrays as long as they are all either at the start or end of the index

👍 This is important for a bunch of my use cases (e.g. image[rr, cc] = [r, g, b], silently equivalent to image[rr, cc, :] = [r, g, b]) but I agree that a staged roll-out is a good plan. Excited to see progress here! 😊🚀🙏

asmeurer · 2024-07-18T21:30:53Z

Another question to consider for __setitem__ is whether setting an array with a different dtype should be allowed. NumPy allows it but PyTorch does not:

>>> a = np.array([0])
>>> b = np.array([1], dtype=np.int32)
>>> a[np.array([0])] = b

>>> a = torch.tensor([0])
>>> b = torch.tensor([1], dtype=torch.int32)
>>> a[torch.tensor([0])] = b
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Index put requires the source and destination dtypes match, got Long for the destination and Int for the source.

(curiously, PyTorch does allow this sort of thing in other indexing contexts. For instance, a[:] = b works, as does a[0] = b[0]).

shoyer · 2024-07-18T22:08:06Z

NumPy allows assignment of an array with any dtype, which is clearly broken:

import numpy as np
x = np.zeros(2, dtype=np.int32)
x[:] = np.array(np.nan)

This results in an array with garbage values, and a warning: RuntimeWarning: invalid value encountered in cast

If we allow casting, it should definitely be restricted to "safe" casting (e.g., assigning a strictly smaller dtype). The conservative choice would only be to allow matching dtypes or compatible Python scalar types.

shoyer mentioned this issue Aug 14, 2023

Compatibility with the Array API standard pydata/xarray#7848

Open

TomNicholas mentioned this issue Oct 16, 2023

NamedArray tracking issue pydata/xarray#8238

Open

20 tasks

kgryte added the API extension Adds new functions or objects to the API. label Oct 19, 2023

havogt mentioned this issue Jan 5, 2024

feat[next]: as_offset implementation in embedded GridTools/gt4py#1397

Open

albert-mathews mentioned this issue Jan 22, 2024

RFC: add support for JAX's at[].set() syntax #734

Closed

kgryte added RFC Request for comments. Feature requests and proposed changes. Needs Discussion Needs further discussion. topic: Indexing Array indexing. labels Apr 4, 2024

kgryte changed the title ~~Indexing with multi-dimensional integer arrays~~ RFC: indexing with multi-dimensional integer arrays Apr 4, 2024

kgryte mentioned this issue Apr 4, 2024

RFC: specify multi-indexing behavior? #410

Closed

NeilGirdhar mentioned this issue Apr 10, 2024

ENH: add one-hot special function or support broadcasting in signal.unit_impulse scipy/scipy#20442

Closed

shoyer mentioned this issue May 3, 2024

Proposal: add APIs for getting and setting elements via a list of indices (i.e., take, put, etc) #177

Open

kgryte mentioned this issue May 30, 2024

RFC: add take_along_axis to take values along a specified dimension #808

Closed

shoyer mentioned this issue Oct 30, 2024

Regression: "TypeError: Vectorized indexing is not supported" with xarray 2024.10.0 + sparse pydata/xarray#9694

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: indexing with multi-dimensional integer arrays #669

RFC: indexing with multi-dimensional integer arrays #669

shoyer commented Jul 31, 2023 •

edited

Loading

leofang commented Jul 31, 2023

seberg commented Jul 31, 2023

shoyer commented Jul 31, 2023

shoyer commented Jul 31, 2023

seberg commented Jul 31, 2023

asmeurer commented Jul 31, 2023

kgryte commented Aug 1, 2023

kgryte commented Oct 19, 2023 •

edited

Loading

vnmabus commented Oct 19, 2023

kgryte commented Oct 19, 2023

seberg commented Oct 19, 2023

shoyer commented Oct 20, 2023

seberg commented Oct 20, 2023 •

edited

Loading

asmeurer commented Oct 20, 2023

rgommers commented Jun 27, 2024

asmeurer commented Jun 27, 2024

asmeurer commented Jun 27, 2024

jni commented Jun 28, 2024 •

edited

Loading

asmeurer commented Jul 18, 2024

shoyer commented Jul 18, 2024

RFC: indexing with multi-dimensional integer arrays #669

RFC: indexing with multi-dimensional integer arrays #669

Comments

shoyer commented Jul 31, 2023 • edited Loading

leofang commented Jul 31, 2023

seberg commented Jul 31, 2023

shoyer commented Jul 31, 2023

shoyer commented Jul 31, 2023

seberg commented Jul 31, 2023

asmeurer commented Jul 31, 2023

kgryte commented Aug 1, 2023

kgryte commented Oct 19, 2023 • edited Loading

Addendum

vnmabus commented Oct 19, 2023

kgryte commented Oct 19, 2023

seberg commented Oct 19, 2023

shoyer commented Oct 20, 2023

seberg commented Oct 20, 2023 • edited Loading

asmeurer commented Oct 20, 2023

rgommers commented Jun 27, 2024

asmeurer commented Jun 27, 2024

asmeurer commented Jun 27, 2024

jni commented Jun 28, 2024 • edited Loading

asmeurer commented Jul 18, 2024

shoyer commented Jul 18, 2024

shoyer commented Jul 31, 2023 •

edited

Loading

kgryte commented Oct 19, 2023 •

edited

Loading

seberg commented Oct 20, 2023 •

edited

Loading

jni commented Jun 28, 2024 •

edited

Loading