Only use CopyOnWriteArray wrapper on BackendArrays #8712

TomNicholas · 2024-02-06T06:05:53Z

This makes sure we only use the CopyOnWriteArray wrapper on arrays that have been explicitly marked to be lazily-loaded (through being subclasses of BackendArray). Without this change we are implicitly assuming that any array type obtained through the BackendEntrypoint system should be treated as if it points to an on-disk array.

Motivated by #8699, which is a counterexample to that assumption.

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

…omNicholas/xarray into only-copyonwrite-backendarrays

for more information, see https://pre-commit.ci

TomNicholas · 2024-02-06T07:38:26Z

Okay I realise now that checking if the backend returned an array that subclassed BackendArray is not just a matter of a single isinstance check. Instead I would have to keep going down through .array.array.array... checking isinstance at each level 🙁

TomNicholas · 2024-02-06T08:47:58Z

Hmm this seems to be conceptually similar to the get_duck_array @dcherian implemented in #6874. I could add a similar method, but it feels like I'm straying from the problem I actually want to solve.

Ultimately I'm trying to deal with this error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[73], line 1
----> 1 result = xr.Variable.concat([ds1_kc['air'].variable, ds2_kc['air'].variable], dim='time')

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:2103, in Variable.concat(cls, variables, dim, positions, shortcut, combine_attrs)
   2101 axis = first_var.get_axis_num(dim)
   2102 dims = first_var_dims
-> 2103 data = duck_array_ops.concatenate(arrays, axis=axis)
   2104 if positions is not None:
   2105     # TODO: deprecate this option -- we don't need it for groupby
   2106     # any more.
   2107     indices = nputils.inverse_permutation(np.concatenate(positions))

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:326, in concatenate(arrays, axis)
    324     xp = get_array_namespace(arrays[0])
    325     return xp.concat(as_shared_dtype(arrays, xp=xp), axis=axis)
--> 326 return _concatenate(as_shared_dtype(arrays), axis=axis)

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:205, in as_shared_dtype(scalars_or_arrays, xp)
    203     arrays = [asarray(x, xp=cp) for x in scalars_or_arrays]
    204 else:
--> 205     arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]
    206 # Pass arrays directly instead of dtypes to result_type so scalars
    207 # get handled properly.
    208 # Note that result_type() safely gets the dtype from dask arrays without
    209 # evaluating them.
    210 out_type = dtypes.result_type(*arrays)

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:192, in asarray(data, xp)
    191 def asarray(data, xp=np):
--> 192     return data if is_duck_array(data) else xp.asarray(data)

File ~/Documents/Work/Code/xarray/xarray/core/indexing.py:471, in ExplicitlyIndexedNDArrayMixin.__array__(self, dtype)
    468 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
    469     # This is necessary because we apply the indexing key in self.get_duck_array()
    470     # Note this is the base class for all lazy indexing classes
--> 471     return np.asarray(self.get_duck_array(), dtype=dtype)

Cell In[28], line 99, in KerchunkArray.__array__(self)
     98 def __array__(self) -> np.ndarray:
---> 99     raise NotImplementedError("KerchunkArrays can't be converted into numpy arrays or pandas Index objects")

NotImplementedError: KerchunkArrays can't be converted into numpy arrays or pandas Index objects

The reason it fails the is_duck_array(data) check is because at this point it sees a CopyOnWriteArray.

TomNicholas · 2024-02-06T08:49:15Z

I realized that the actual problem is similar to that in #4231, but I need a solution that never calls __array__ on the wrapped duck array type, not just a solution that only works specifically for cupy arrays.

keewis · 2024-02-06T11:16:09Z

a few months ago, I created keewis/nested-duck-arrays to tackle the problem in #4231. Do you think the idea there would be applicable here, as well?

TomNicholas · 2024-02-06T18:03:30Z

a few months ago, I created keewis/nested-duck-arrays to tackle the problem in #4231. Do you think the idea there would be applicable here, as well?

Huh, yeah I guess that is the same as this problem! If we implemented this protocol on all of our lazy indexing classes I could interrogate it to solve this issue. We could also use it instead of the get_duck_array method that @dcherian added in #6874 (it's basically just a more general version of the same idea).

dcherian · 2024-02-06T18:54:55Z

xarray/backends/api.py

@@ -236,9 +236,16 @@ def _protect_dataset_variables_inplace(dataset, cache):
    for name, variable in dataset.variables.items():
        if name not in dataset._indexes:
            # no need to protect IndexVariable objects
-            data = indexing.CopyOnWriteArray(variable._data)
+
+            if isinstance(variable._data, backends.BackendArray):


This seems not right to me. We are in backends/api.py these should all be BackendArrays.

The way that BackendArray is presented in How to add a new backend strongly implies that using BackendArray is optional. In the KerchunkArray case I specifically don't want to use BackendArray, even though I do want to write a backend (from #8714 (comment)):

my KerchunkArray case is interesting because I don't want to use BackendArray (I have no use for CopyOnWrite because I'm never loading bytes, nor for Lazy indexing (I can't index into the KerchunkArray at all).

dcherian · 2024-02-07T17:09:07Z

After some chatting, @TomNicholas and I are in agreement that there is a bug where np.asarray is being called instead of get_duck_array(), perhaps here:

xarray/xarray/core/indexing.py

Line 663 in 0eb6658

self.array = as_indexable(np.array(self.array))

There's also #8408 which needs a test.

If KerchunkArray was a "duck array" that implemented np.concatenate and __array_function__ then a Variable can wrap it, and the concatenation step will dispatch down to KerchunkArray to handle

TomNicholas added 4 commits February 6, 2024 00:41

only wrap BackendArrays in CopyOnWriteArray

ca17266

only wrap BackendArrays in CopyOnWriteArray

99a37be

document how CopyOnWrite happens for BackendArray subclasses

c80fefd

Merge branch 'only-copyonwrite-backendarrays' of https://github.com/T…

3f2b82d

…omNicholas/xarray into only-copyonwrite-backendarrays

TomNicholas added the topic-backends label Feb 6, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

22ec10f

for more information, see https://pre-commit.ci

TomNicholas mentioned this pull request Feb 6, 2024

Avoid coercing to numpy in as_shared_dtypes #8714

Open

4 tasks

TomNicholas added the topic-arrays related to flexible array support label Feb 6, 2024

dcherian reviewed Feb 6, 2024

View reviewed changes

TomNicholas mentioned this pull request Mar 15, 2024

Opening via xarray backendentrypoint zarr-developers/VirtualiZarr#35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only use CopyOnWriteArray wrapper on BackendArrays #8712

Only use CopyOnWriteArray wrapper on BackendArrays #8712

TomNicholas commented Feb 6, 2024 •

edited

Loading

TomNicholas commented Feb 6, 2024 •

edited

Loading

TomNicholas commented Feb 6, 2024 •

edited

Loading

TomNicholas commented Feb 6, 2024 •

edited

Loading

keewis commented Feb 6, 2024 •

edited

Loading

TomNicholas commented Feb 6, 2024

dcherian Feb 6, 2024

TomNicholas Feb 6, 2024

dcherian commented Feb 7, 2024 •

edited

Loading

Only use CopyOnWriteArray wrapper on BackendArrays #8712

Are you sure you want to change the base?

Only use CopyOnWriteArray wrapper on BackendArrays #8712

Conversation

TomNicholas commented Feb 6, 2024 • edited Loading

TomNicholas commented Feb 6, 2024 • edited Loading

TomNicholas commented Feb 6, 2024 • edited Loading

TomNicholas commented Feb 6, 2024 • edited Loading

keewis commented Feb 6, 2024 • edited Loading

TomNicholas commented Feb 6, 2024

dcherian Feb 6, 2024

Choose a reason for hiding this comment

TomNicholas Feb 6, 2024

Choose a reason for hiding this comment

dcherian commented Feb 7, 2024 • edited Loading

TomNicholas commented Feb 6, 2024 •

edited

Loading

TomNicholas commented Feb 6, 2024 •

edited

Loading

TomNicholas commented Feb 6, 2024 •

edited

Loading

TomNicholas commented Feb 6, 2024 •

edited

Loading

keewis commented Feb 6, 2024 •

edited

Loading

dcherian commented Feb 7, 2024 •

edited

Loading