Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only use CopyOnWriteArray wrapper on BackendArrays #8712

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Feb 6, 2024

This makes sure we only use the CopyOnWriteArray wrapper on arrays that have been explicitly marked to be lazily-loaded (through being subclasses of BackendArray). Without this change we are implicitly assuming that any array type obtained through the BackendEntrypoint system should be treated as if it points to an on-disk array.

Motivated by #8699, which is a counterexample to that assumption.

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@TomNicholas
Copy link
Member Author

TomNicholas commented Feb 6, 2024

Okay I realise now that checking if the backend returned an array that subclassed BackendArray is not just a matter of a single isinstance check. Instead I would have to keep going down through .array.array.array... checking isinstance at each level 🙁

@TomNicholas
Copy link
Member Author

TomNicholas commented Feb 6, 2024

Hmm this seems to be conceptually similar to the get_duck_array @dcherian implemented in #6874. I could add a similar method, but it feels like I'm straying from the problem I actually want to solve.

Ultimately I'm trying to deal with this error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[73], line 1
----> 1 result = xr.Variable.concat([ds1_kc['air'].variable, ds2_kc['air'].variable], dim='time')

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:2103, in Variable.concat(cls, variables, dim, positions, shortcut, combine_attrs)
   2101 axis = first_var.get_axis_num(dim)
   2102 dims = first_var_dims
-> 2103 data = duck_array_ops.concatenate(arrays, axis=axis)
   2104 if positions is not None:
   2105     # TODO: deprecate this option -- we don't need it for groupby
   2106     # any more.
   2107     indices = nputils.inverse_permutation(np.concatenate(positions))

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:326, in concatenate(arrays, axis)
    324     xp = get_array_namespace(arrays[0])
    325     return xp.concat(as_shared_dtype(arrays, xp=xp), axis=axis)
--> 326 return _concatenate(as_shared_dtype(arrays), axis=axis)

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:205, in as_shared_dtype(scalars_or_arrays, xp)
    203     arrays = [asarray(x, xp=cp) for x in scalars_or_arrays]
    204 else:
--> 205     arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]
    206 # Pass arrays directly instead of dtypes to result_type so scalars
    207 # get handled properly.
    208 # Note that result_type() safely gets the dtype from dask arrays without
    209 # evaluating them.
    210 out_type = dtypes.result_type(*arrays)

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:192, in asarray(data, xp)
    191 def asarray(data, xp=np):
--> 192     return data if is_duck_array(data) else xp.asarray(data)

File ~/Documents/Work/Code/xarray/xarray/core/indexing.py:471, in ExplicitlyIndexedNDArrayMixin.__array__(self, dtype)
    468 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
    469     # This is necessary because we apply the indexing key in self.get_duck_array()
    470     # Note this is the base class for all lazy indexing classes
--> 471     return np.asarray(self.get_duck_array(), dtype=dtype)

Cell In[28], line 99, in KerchunkArray.__array__(self)
     98 def __array__(self) -> np.ndarray:
---> 99     raise NotImplementedError("KerchunkArrays can't be converted into numpy arrays or pandas Index objects")

NotImplementedError: KerchunkArrays can't be converted into numpy arrays or pandas Index objects

The reason it fails the is_duck_array(data) check is because at this point it sees a CopyOnWriteArray.

@TomNicholas
Copy link
Member Author

TomNicholas commented Feb 6, 2024

I realized that the actual problem is similar to that in #4231, but I need a solution that never calls __array__ on the wrapped duck array type, not just a solution that only works specifically for cupy arrays.

@keewis
Copy link
Collaborator

keewis commented Feb 6, 2024

a few months ago, I created keewis/nested-duck-arrays to tackle the problem in #4231. Do you think the idea there would be applicable here, as well?

@TomNicholas
Copy link
Member Author

a few months ago, I created keewis/nested-duck-arrays to tackle the problem in #4231. Do you think the idea there would be applicable here, as well?

Huh, yeah I guess that is the same as this problem! If we implemented this protocol on all of our lazy indexing classes I could interrogate it to solve this issue. We could also use it instead of the get_duck_array method that @dcherian added in #6874 (it's basically just a more general version of the same idea).

@TomNicholas TomNicholas added the topic-arrays related to flexible array support label Feb 6, 2024
@@ -236,9 +236,16 @@ def _protect_dataset_variables_inplace(dataset, cache):
for name, variable in dataset.variables.items():
if name not in dataset._indexes:
# no need to protect IndexVariable objects
data = indexing.CopyOnWriteArray(variable._data)

if isinstance(variable._data, backends.BackendArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not right to me. We are in backends/api.py these should all be BackendArrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that BackendArray is presented in How to add a new backend strongly implies that using BackendArray is optional. In the KerchunkArray case I specifically don't want to use BackendArray, even though I do want to write a backend (from #8714 (comment)):

my KerchunkArray case is interesting because I don't want to use BackendArray (I have no use for CopyOnWrite because I'm never loading bytes, nor for Lazy indexing (I can't index into the KerchunkArray at all).

@dcherian
Copy link
Contributor

dcherian commented Feb 7, 2024

After some chatting, @TomNicholas and I are in agreement that there is a bug where np.asarray is being called instead of get_duck_array(), perhaps here:

self.array = as_indexable(np.array(self.array))

There's also #8408 which needs a test.

If KerchunkArray was a "duck array" that implemented np.concatenate and __array_function__ then a Variable can wrap it, and the concatenation step will dispatch down to KerchunkArray to handle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support topic-backends
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants