-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only use CopyOnWriteArray wrapper on BackendArrays #8712
base: main
Are you sure you want to change the base?
Only use CopyOnWriteArray wrapper on BackendArrays #8712
Conversation
…omNicholas/xarray into only-copyonwrite-backendarrays
for more information, see https://pre-commit.ci
Okay I realise now that checking if the backend returned an array that subclassed |
Hmm this seems to be conceptually similar to the Ultimately I'm trying to deal with this error: ---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[73], line 1
----> 1 result = xr.Variable.concat([ds1_kc['air'].variable, ds2_kc['air'].variable], dim='time')
File ~/Documents/Work/Code/xarray/xarray/core/variable.py:2103, in Variable.concat(cls, variables, dim, positions, shortcut, combine_attrs)
2101 axis = first_var.get_axis_num(dim)
2102 dims = first_var_dims
-> 2103 data = duck_array_ops.concatenate(arrays, axis=axis)
2104 if positions is not None:
2105 # TODO: deprecate this option -- we don't need it for groupby
2106 # any more.
2107 indices = nputils.inverse_permutation(np.concatenate(positions))
File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:326, in concatenate(arrays, axis)
324 xp = get_array_namespace(arrays[0])
325 return xp.concat(as_shared_dtype(arrays, xp=xp), axis=axis)
--> 326 return _concatenate(as_shared_dtype(arrays), axis=axis)
File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:205, in as_shared_dtype(scalars_or_arrays, xp)
203 arrays = [asarray(x, xp=cp) for x in scalars_or_arrays]
204 else:
--> 205 arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]
206 # Pass arrays directly instead of dtypes to result_type so scalars
207 # get handled properly.
208 # Note that result_type() safely gets the dtype from dask arrays without
209 # evaluating them.
210 out_type = dtypes.result_type(*arrays)
File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:192, in asarray(data, xp)
191 def asarray(data, xp=np):
--> 192 return data if is_duck_array(data) else xp.asarray(data)
File ~/Documents/Work/Code/xarray/xarray/core/indexing.py:471, in ExplicitlyIndexedNDArrayMixin.__array__(self, dtype)
468 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
469 # This is necessary because we apply the indexing key in self.get_duck_array()
470 # Note this is the base class for all lazy indexing classes
--> 471 return np.asarray(self.get_duck_array(), dtype=dtype)
Cell In[28], line 99, in KerchunkArray.__array__(self)
98 def __array__(self) -> np.ndarray:
---> 99 raise NotImplementedError("KerchunkArrays can't be converted into numpy arrays or pandas Index objects")
NotImplementedError: KerchunkArrays can't be converted into numpy arrays or pandas Index objects The reason it fails the |
I realized that the actual problem is similar to that in #4231, but I need a solution that never calls |
a few months ago, I created keewis/nested-duck-arrays to tackle the problem in #4231. Do you think the idea there would be applicable here, as well? |
Huh, yeah I guess that is the same as this problem! If we implemented this protocol on all of our lazy indexing classes I could interrogate it to solve this issue. We could also use it instead of the |
@@ -236,9 +236,16 @@ def _protect_dataset_variables_inplace(dataset, cache): | |||
for name, variable in dataset.variables.items(): | |||
if name not in dataset._indexes: | |||
# no need to protect IndexVariable objects | |||
data = indexing.CopyOnWriteArray(variable._data) | |||
|
|||
if isinstance(variable._data, backends.BackendArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not right to me. We are in backends/api.py
these should all be BackendArrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way that BackendArray
is presented in How to add a new backend strongly implies that using BackendArray
is optional. In the KerchunkArray
case I specifically don't want to use BackendArray
, even though I do want to write a backend (from #8714 (comment)):
my KerchunkArray case is interesting because I don't want to use BackendArray (I have no use for CopyOnWrite because I'm never loading bytes, nor for Lazy indexing (I can't index into the KerchunkArray at all).
After some chatting, @TomNicholas and I are in agreement that there is a bug where xarray/xarray/core/indexing.py Line 663 in 0eb6658
There's also #8408 which needs a test. If |
This makes sure we only use the
CopyOnWriteArray
wrapper on arrays that have been explicitly marked to be lazily-loaded (through being subclasses ofBackendArray
). Without this change we are implicitly assuming that any array type obtained through the BackendEntrypoint system should be treated as if it points to an on-disk array.Motivated by #8699, which is a counterexample to that assumption.
whats-new.rst
api.rst