-
Notifications
You must be signed in to change notification settings - Fork 45
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partial adoption of Array API for Xarray's NamedArray
#698
Comments
I think the main thing to consider is how usable this would be for consumers of the array API. If you don't accept numbered axes, then any library written against the array API that has code using numbered axes won't actually work with xarray. |
I planned to use NamedArray just to keep track of dimension names across operations (so I do not have to implement that logic myself), but I needed it to be compatible with the array API standard. It is ok for me if you compare dimension names for some operations (such as attempting to sum different things) and raise an exception in that case. It ideally should also support operations between NamedArrays and the underlying wrapped (unnamed) array type. Would that not be possible? |
Ccing @shoyer / @Illviljan / @TomNicholas / @dcherian for visibility |
@vnmabus I may have misunderstood you but my understanding is that these are two conflicting requirements. Compatibility with the full API standard implies broadcasting by axis number, including in binary operations like sum. But if we compare dimension names to use those for broadcasting during a sum operation then that is a different behaviour, which is inconsistent with broadcasting by axis number. To see what I mean consider summing two arrays, which are identical up to a transposition. In [1]: import xarray as xr
In [2]: import numpy as np
In [3]: arr = np.array([[1, 2, 3], [4, 5, 6]])
In [4]: arr.shape
Out[4]: (2, 3)
In [5]: arr.T
Out[5]:
array([[1, 4],
[2, 5],
[3, 6]])
In [6]: arr.T.shape
Out[6]: (3, 2)
In [7]: arr + arr.T
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 1
----> 1 arr + arr.T
ValueError: operands could not be broadcast together with shapes (2,3) (3,2)
In [8]: var = xr.Variable(data=arr, dims=['x', 'y'])
In [9]: var.shape
Out[9]: (2, 3)
In [10]: var.T
Out[10]:
<xarray.Variable (y: 3, x: 2)>
array([[1, 4],
[2, 5],
[3, 6]])
In [11]: var.T.shape
Out[11]: (3, 2)
In [12]: var + var.T
Out[12]:
<xarray.Variable (x: 2, y: 3)>
array([[ 2, 4, 6],
[ 8, 10, 12]]) You can see that in numpy (broadcasting by axis number) then the array shapes |
Now perhaps we can re-explain @andersy005 's question above: operations on scalars are the only part of the array API where axis number & broadcasting rules are irrelevant, so it's the only part of the array API standard that xarray could obey. The question is should xarray obey only a part of the full standard, or is it better to either obey it all or none of it? |
Finally you could imagine allowing this, but it would create a weird situation where xarray-only code was robust to transposition, but any operations between xarray and numpy objects were not invariant to transposition. |
Well, I was not aware of that behavior of xarray, and for me it seems dangerous, and not very consistent. What if I wanted to divide, for example, an array with a "distance" dimension and one with a "time" dimension. Would xarray do it or it would refuse as the dimensions do not match? What if the arrays have additional dimensions with similar names? For me that is not a desirable behavior. What I like is simply an array type (supporting the full array API standard) that wraps other array types and adds named dimensions, that work across operations. It could be ok if they check that the dim name is the same for some operations, like adding two arrays, although even in that case it seems dubious to me (what if I wanted to sum dims "initialization time" and "computation time"? Honestly, I think that check should be done at the unit level and not at the name level). |
If you divide an array with a "distance" dimension by an array with a "time" dimension in Xarray, you get an array with two dimensions: "distance" and "time". Some other operations might will raise an error with different dimension names, but in no cases will dimensions names be ignored.
I agree, this is certainly a reasonable data structure to build, but it isn't Xarray. This is basically the model used by PyTorch's Named Tensor. |
I guess it's worth considering it just in the sense that the function names are standardized so it's useful to use them for the sake of consistency across libraries (this even for functions that otherwise have an It's likely that most nontrivial code written against the array API won't work with named arrays but perhaps there is some code that will. Perhaps it would be instructive to take a look at scipy and scikit-learn's array API code and see whether it would otherwise work with xarray. |
I could interpret this question in two ways:
I think the answer to (1) is yes, why not - a lot of effort went into making every function in the standard have a clean signature and well-defined semantics, and comparing with many array libraries to make sure it matches them as well as possible (unless there's a good reason not to). The answer to (2) should be no I believe. That will be incorrect, and for libraries that try to write array-library-agnostic code it may lead to silently incorrect results. |
We could raise for every op that is not an elementwise function but that seems against the spirit of the standard. Would you agree? |
Probably against the letter too, and not just the spirit - since there's many places where the wording is "must behave like X". |
Point taken! 😂 Thank you |
As I don't think there is anything more to discuss here as pertains to revising the array API standard, I'll go ahead and close this. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
we've been having a discussion over in pydata/xarray#8333 about adopting Array API and the possibility of offering an array-API compatible interface via
namedarray
, a new Xarray data structure that aims to be a lighter version of xarray's Variable: pydata/xarray#8080although there's general agreement that
xarray.NamedArray
should not offer an array-API compatible interface --mainly due to Xarray's unique semantics like broadcasting by dimension name rather than dimension order (e.g.xp.mean(xarray.NamedArray, axis=1)
vsxarray.NamedArray.mean(dim="time")
--we're keen on adopting element-wise functions and how they would interact with xarray.the most important point at issue is, we find that these element-wise functions like
xp.abs(xarray.NamedArray) --> xarray.NamedArray
are critical to our use-case. however, we're not looking to adopt the entire Array API standard, but are there any guidelines or recommendations for adopting a subset of it? we thought it's worth raising this issue, to see if folks here think this is a good idea or not.The text was updated successfully, but these errors were encountered: