Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid calling np.asarray on lazy indexing classes #6874

Merged
merged 59 commits into from
Mar 31, 2023
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
45cd500
Add get_array to lazy indexing array types.
dcherian Aug 2, 2022
9c0350c
Rename to short_array_repr; use Variable.data
dcherian Aug 2, 2022
74afa53
Fix Variable.load
dcherian Aug 2, 2022
9de7427
Make get_array recursive.
dcherian Aug 2, 2022
cc0a653
Some cleanups
dcherian Aug 2, 2022
59c7ead
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 3, 2022
2aa0830
Add get_array to PandasIndexingAdaptor
dcherian Aug 3, 2022
1306758
Finish short_array_repr refactoring
dcherian Aug 3, 2022
cf67972
Rename to get_duck_array
dcherian Aug 3, 2022
0209900
Try without hasattr check
dcherian Aug 3, 2022
536648a
Return bare array from LazilyIndexedArray.get_duck_array
dcherian Aug 3, 2022
f2514c7
Add get_duck_array to AbstractArray
dcherian Aug 3, 2022
3c597d4
Fix zerodim test
dcherian Aug 3, 2022
201eeba
Fix LazilyVectorizedIndexedArray
dcherian Aug 5, 2022
cd02a8a
Inherit __array__ from ExplicitlyIndexed
dcherian Aug 5, 2022
7ef55e0
Fix InaccessibleArray in tests
dcherian Aug 5, 2022
4e77fec
Fix BackendArray
dcherian Aug 5, 2022
d14c61f
Merge branch 'main' into kvikio
dcherian Aug 6, 2022
19af950
Merge branch 'main' into kvikio
dcherian Aug 9, 2022
22db817
reprs Use .data on AbstractArray
dcherian Aug 10, 2022
2bbcc16
Force netCDF and h5netCDF to return arrays
dcherian Aug 10, 2022
ca2a10a
Add whats-new
dcherian Aug 10, 2022
9256dd0
Merge branch 'main' into kvikio
dcherian Aug 12, 2022
906c3b3
Add comments; review feedback
dcherian Aug 16, 2022
598c201
Merge branch 'main' into kvikio
dcherian Nov 27, 2022
941c643
Fix merge.
dcherian Nov 27, 2022
d1127fe
Remove another np.asarray
dcherian Aug 16, 2022
c0c78a1
Avoid np.asarray on __getitem__.
dcherian Jan 17, 2023
9b727e6
[WIP] ExplicitlyIndexedBackendArray
dcherian Oct 11, 2022
46d98ec
Handle Indexing Adapter classes explicitly.
dcherian Jan 17, 2023
b19a24b
Revert "Handle Indexing Adapter classes explicitly."
dcherian Jan 18, 2023
84f560f
Revert "[WIP] ExplicitlyIndexedBackendArray"
dcherian Jan 18, 2023
426519f
Merge branch 'main' into kvikio
dcherian Jan 18, 2023
c4b81bf
Fix pydap now that NumpyIndexingAdapter does not automatically cast t…
dcherian Jan 19, 2023
d11a3cf
Update xarray/backends/pydap_.py
dcherian Jan 19, 2023
c223617
Add test store.
dcherian Jan 19, 2023
51552d4
Merge branch 'main' into kvikio
dcherian Jan 19, 2023
5f1cf53
[skip-ci] Update whats-new
dcherian Jan 19, 2023
937d572
Fix test
dcherian Jan 19, 2023
cc7d0b5
fix mypy?
dcherian Jan 19, 2023
1576261
Fix Zarr test
dcherian Jan 19, 2023
7d8459e
test the repr too
dcherian Jan 19, 2023
9815b75
Guard np.asarray for scalars.
dcherian Jan 20, 2023
39e7529
Revert casting to arrays in backend
dcherian Jan 24, 2023
f304bcb
Wrap numpy scalars in Explicitly*Indexed*.get_duck_aray
dcherian Jan 24, 2023
6cb1677
Merge branch 'main' into kvikio
dcherian Jan 26, 2023
2c7da96
Merge branch 'main' into kvikio
dcherian Feb 13, 2023
26d224c
Apply suggestions from code review
dcherian Feb 16, 2023
65da209
Update xarray/tests/__init__.py
dcherian Feb 16, 2023
0bc1175
Update xarray/core/indexing.py
dcherian Mar 3, 2023
8c2d74c
Apply suggestions from code review
dcherian Mar 3, 2023
20c8c81
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2023
77f7059
Bring back the ugly check
dcherian Mar 4, 2023
517f195
Merge branch 'main' into kvikio
dcherian Mar 26, 2023
5c23bd2
Update whats-new
dcherian Mar 26, 2023
887e1c5
Fix pre-commit
dcherian Mar 26, 2023
2557d02
silence mypy error
dcherian Mar 26, 2023
b313258
minimize diff
dcherian Mar 26, 2023
cbd030e
Merge branch 'main' into kvikio
dcherian Mar 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,11 @@ Documentation
Internal Changes
~~~~~~~~~~~~~~~~

- Don't assume that arrays read from disk will be Numpy arrays. This is a step toward
enabling reads from a Zarr store using the `Kvikio <https://docs.rapids.ai/api/kvikio/stable/api.html#zarr>`_
or `TensorStore <https://google.github.io/tensorstore/>`_ libraries.
(:pull:`6874`). By `Deepak Cherian <https://github.com/dcherian>`_.


.. _whats-new.2023.01.0:

Expand Down Expand Up @@ -147,9 +152,6 @@ Documentation
- Add example of reading and writing individual groups to a single netCDF file to I/O docs page. (:pull:`7338`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.

Internal Changes
~~~~~~~~~~~~~~~~


.. _whats-new.2022.11.0:

Expand Down
4 changes: 2 additions & 2 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,9 @@ def robust_getitem(array, key, catch=Exception, max_retries=6, initial_delay=500
class BackendArray(NdimSizeLenMixin, indexing.ExplicitlyIndexed):
__slots__ = ()

def __array__(self, dtype=None):
def get_duck_array(self, dtype=None):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
key = indexing.BasicIndexer((slice(None),) * self.ndim)
return np.asarray(self[key], dtype=dtype)
return self[key]


class AbstractDataStore:
Expand Down
1 change: 1 addition & 0 deletions xarray/backends/pydap_.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def _getitem(self, key):
# downloading coordinate data twice
array = getattr(self.array, "array", self.array)
result = robust_getitem(array, key, catch=ValueError)
result = np.asarray(result)
Copy link
Contributor Author

@dcherian dcherian Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently return pydap.BaseType and rely on an np.asarray call in explicit_indexing_adapter. I've removed that call below.

# in some cases, pydap doesn't squeeze axes automatically like numpy
axis = tuple(n for n, k in enumerate(key) if isinstance(k, integer_types))
if result.ndim + len(axis) != array.ndim and axis:
Expand Down
6 changes: 3 additions & 3 deletions xarray/coding/variables.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,8 @@ def dtype(self) -> np.dtype:
def __getitem__(self, key):
return type(self)(self.array[key], self.func, self.dtype)

def __array__(self, dtype=None):
return self.func(self.array)
def get_duck_array(self):
return self.func(self.array.get_duck_array())

def __repr__(self) -> str:
return "{}({!r}, func={!r}, dtype={!r})".format(
Expand Down Expand Up @@ -223,7 +223,7 @@ def decode(self, variable: Variable, name: T_Name = None):


def _scale_offset_decoding(data, scale_factor, add_offset, dtype: np.typing.DTypeLike):
data = np.array(data, dtype=dtype, copy=True)
data = data.astype(dtype=dtype, copy=True)
if scale_factor is not None:
data *= scale_factor
if add_offset is not None:
Expand Down
19 changes: 13 additions & 6 deletions xarray/core/formatting.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from pandas.errors import OutOfBoundsDatetime

from xarray.core.duck_array_ops import array_equiv
from xarray.core.indexing import MemoryCachedArray
from xarray.core.indexing import ExplicitlyIndexed, MemoryCachedArray
from xarray.core.options import OPTIONS, _get_boolean_with_default
from xarray.core.pycompat import array_type
from xarray.core.utils import is_duck_array
Expand Down Expand Up @@ -557,8 +557,15 @@ def limit_lines(string: str, *, limit: int):
return string


def short_numpy_repr(array):
array = np.asarray(array)
def short_array_repr(array):
from xarray.core.common import AbstractArray

if isinstance(array, ExplicitlyIndexed):
array = array.get_duck_array()
elif isinstance(array, AbstractArray):
array = array.data
if not is_duck_array(array):
array = np.asarray(array)

# default to lower precision so a full (abbreviated) line can fit on
# one line with the default display_width
Expand All @@ -582,11 +589,11 @@ def short_data_repr(array):
"""Format "data" for DataArray and Variable."""
internal_data = getattr(array, "variable", array)._data
if isinstance(array, np.ndarray):
return short_numpy_repr(array)
return short_array_repr(array)
elif is_duck_array(internal_data):
return limit_lines(repr(array.data), limit=40)
elif array._in_memory:
return short_numpy_repr(array)
return short_array_repr(array)
else:
# internal xarray array type
return f"[{array.size} values with dtype={array.dtype}]"
Expand Down Expand Up @@ -831,7 +838,7 @@ def diff_array_repr(a, b, compat):
equiv = array_equiv

if not equiv(a.data, b.data):
temp = [wrap_indent(short_numpy_repr(obj), start=" ") for obj in (a, b)]
temp = [wrap_indent(short_array_repr(obj), start=" ") for obj in (a, b)]
diff_data_repr = [
ab_side + "\n" + ab_data_repr
for ab_side, ab_data_repr in zip(("L", "R"), temp)
Expand Down
60 changes: 45 additions & 15 deletions xarray/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,13 +442,25 @@ class ExplicitlyIndexed:

__slots__ = ()

def __array__(self, dtype=None):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
# Leave casting to an array up to the underlying array type.
return np.asarray(self.get_duck_array(), dtype=dtype)

def get_duck_array(self):
return self.array


class ExplicitlyIndexedNDArrayMixin(NDArrayMixin, ExplicitlyIndexed):
__slots__ = ()

def __array__(self, dtype=None):
def get_duck_array(self):
key = BasicIndexer((slice(None),) * self.ndim)
return np.asarray(self[key], dtype=dtype)
return self[key]

def __array__(self, dtype=None):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
# This is necessary because we apply the indexing key in self.get_duck_array()
# Note this is the base class for all lazy indexing classes
return np.asarray(self.get_duck_array(), dtype=dtype)


class ImplicitToExplicitIndexingAdapter(NDArrayMixin):
Expand All @@ -461,7 +473,10 @@ def __init__(self, array, indexer_cls=BasicIndexer):
self.indexer_cls = indexer_cls

def __array__(self, dtype=None):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
return np.asarray(self.array, dtype=dtype)
return np.asarray(self.get_duck_array(), dtype=dtype)

def get_duck_array(self):
return self.array.get_duck_array()

def __getitem__(self, key):
key = expanded_indexer(key, self.ndim)
Expand Down Expand Up @@ -524,9 +539,16 @@ def shape(self) -> tuple[int, ...]:
shape.append(k.size)
return tuple(shape)

def __array__(self, dtype=None):
def get_duck_array(self):
array = as_indexable(self.array)
return np.asarray(array[self.key], dtype=None)
array = array[self.key]
# array[self.key] is now a numpy array when
# self.array is a BackendArray subclass
# and self.key is BasicIndexer((slice(None, None, None),))
# so we need the explicit check for ExplicitlyIndexed
if isinstance(array, ExplicitlyIndexed):
array = array.get_duck_array()
dcherian marked this conversation as resolved.
Show resolved Hide resolved
return _wrap_numpy_scalars(array)
Copy link
Contributor Author

@dcherian dcherian Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding _wrap_numpy_scalars allows us to handle scalars being returned by the backend. This seems OK to me in that we place fewer restrictions on the backend (and is backward compatible).

def _wrap_numpy_scalars(array):
"""Wrap NumPy scalars in 0d arrays."""
if np.isscalar(array):
return np.array(array)
else:
return array

But now the issue is that we should pass an appropriate like argument to np.array but I don't see how to that from a scalar array

Good news is that backends can avoid this complication by returning arrays, so we could just ignore this ugly bit for now.


def transpose(self, order):
return LazilyVectorizedIndexedArray(self.array, self.key).transpose(order)
Expand Down Expand Up @@ -577,8 +599,11 @@ def __init__(self, array, key):
def shape(self) -> tuple[int, ...]:
return np.broadcast(*self.key.tuple).shape

def __array__(self, dtype=None):
return np.asarray(self.array[self.key], dtype=None)
def get_duck_array(self):
array = self.array[self.key]
if isinstance(array, ExplicitlyIndexed):
array = array.get_duck_array()
return _wrap_numpy_scalars(array)

def _updated_key(self, new_key):
return _combine_indexers(self.key, self.shape, new_key)
Expand Down Expand Up @@ -624,8 +649,8 @@ def _ensure_copied(self):
self.array = as_indexable(np.array(self.array))
self._copied = True

def __array__(self, dtype=None):
return np.asarray(self.array, dtype=dtype)
def get_duck_array(self):
return self.array.get_duck_array()

def __getitem__(self, key):
return type(self)(_wrap_numpy_scalars(self.array[key]))
Expand All @@ -651,12 +676,14 @@ def __init__(self, array):
self.array = _wrap_numpy_scalars(as_indexable(array))

def _ensure_cached(self):
if not isinstance(self.array, NumpyIndexingAdapter):
self.array = NumpyIndexingAdapter(np.asarray(self.array))
self.array = as_indexable(self.array.get_duck_array())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right generalization. as_indexable should choose an appropriate indexing adapter.


def __array__(self, dtype=None):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
return np.asarray(self.get_duck_array(), dtype=dtype)

def get_duck_array(self):
self._ensure_cached()
return np.asarray(self.array, dtype=dtype)
return self.array.get_duck_array()

def __getitem__(self, key):
return type(self)(_wrap_numpy_scalars(self.array[key]))
Expand Down Expand Up @@ -820,7 +847,7 @@ def explicit_indexing_adapter(
result = raw_indexing_method(raw_key.tuple)
if numpy_indices.tuple:
# index the loaded np.ndarray
result = NumpyIndexingAdapter(np.asarray(result))[numpy_indices]
result = NumpyIndexingAdapter(result)[numpy_indices]
Copy link
Contributor Author

@dcherian dcherian Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure about this but it forces us to explicitly pass numpy arrays to NumpyIndexingAdapter. Only the pydap test failed, so I cast those objects to np.array in the backend.

return result


Expand Down Expand Up @@ -1443,6 +1470,9 @@ def __array__(self, dtype: DTypeLike = None) -> np.ndarray:
array = array.astype("object")
return np.asarray(array.values, dtype=dtype)

def get_duck_array(self) -> np.ndarray:
return np.asarray(self)

@property
def shape(self) -> tuple[int, ...]:
return (len(self.array),)
Expand Down Expand Up @@ -1583,9 +1613,9 @@ def _repr_inline_(self, max_width: int) -> str:
return format_array_flat(self._get_array_subset(), max_width)

def _repr_html_(self) -> str:
from xarray.core.formatting import short_numpy_repr
from xarray.core.formatting import short_array_repr

array_repr = short_numpy_repr(self._get_array_subset())
array_repr = short_array_repr(self._get_array_subset())
return f"<pre>{escape(array_repr)}</pre>"

def copy(self, deep: bool = True) -> PandasMultiIndexingAdapter:
Expand Down
4 changes: 4 additions & 0 deletions xarray/core/variable.py
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,8 @@ def data(self) -> Any:
"""
if is_duck_array(self._data):
return self._data
elif isinstance(self._data, indexing.ExplicitlyIndexed):
return self._data.get_duck_array()
else:
return self.values

Expand Down Expand Up @@ -537,6 +539,8 @@ def load(self, **kwargs):
"""
if is_duck_dask_array(self._data):
self._data = as_compatible_data(self._data.compute(**kwargs))
elif isinstance(self._data, indexing.ExplicitlyIndexed):
self._data = self._data.get_duck_array()
elif not is_duck_array(self._data):
self._data = np.asarray(self._data)
return self
Expand Down
25 changes: 25 additions & 0 deletions xarray/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,12 +141,37 @@ class UnexpectedDataAccess(Exception):


class InaccessibleArray(utils.NDArrayMixin, ExplicitlyIndexed):
"""Disallows any loading."""

def __init__(self, array):
self.array = array

def get_duck_array(self):
raise UnexpectedDataAccess("Tried accessing data")

def __array__(self, dtype=None):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
raise UnexpectedDataAccess("Tried accessing data")

def __getitem__(self, key):
raise UnexpectedDataAccess("Tried accessing data")


class DuckArrayWrapper(utils.NDArrayMixin):
"""Array-like that prevents casting to array.
Modeled after cupy."""

def __init__(self, array: np.ndarray):
self.array = array

def __getitem__(self, key):
return type(self)(self.array[key])

def __array__(self, dtype=None):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
raise UnexpectedDataAccess("Tried accessing data")

def __array_namespace__(self):
"""Present to satisfy is_duck_array test."""


class ReturnItem:
def __getitem__(self, key):
Expand Down
69 changes: 69 additions & 0 deletions xarray/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
from xarray.core.pycompat import array_type, integer_types
from xarray.core.utils import is_scalar
from xarray.tests import (
DuckArrayWrapper,
InaccessibleArray,
UnexpectedDataAccess,
assert_allclose,
Expand Down Expand Up @@ -201,6 +202,10 @@ def create_test_stacked_array() -> tuple[DataArray, DataArray]:


class InaccessibleVariableDataStore(backends.InMemoryDataStore):
"""
Store that does not allow any data access.
"""

def __init__(self):
super().__init__()
self._indexvars = set()
Expand All @@ -221,6 +226,47 @@ def lazy_inaccessible(k, v):
return {k: lazy_inaccessible(k, v) for k, v in self._variables.items()}


class DuckBackendArrayWrapper(backends.common.BackendArray):
"""Mimic a BackendArray wrapper around DuckArrayWrapper"""

def __init__(self, array):
self.array = DuckArrayWrapper(array)
self.shape = array.shape
self.dtype = array.dtype

def get_array(self):
return self.array

def __getitem__(self, key):
return self.array[key.tuple]


class AccessibleAsDuckArrayDataStore(backends.InMemoryDataStore):
"""
Store that returns a duck array, not convertible to numpy array,
on read. Modeled after nVIDIA's kvikio.
"""

def __init__(self):
super().__init__()
self._indexvars = set()

def store(self, variables, *args, **kwargs) -> None:
super().store(variables, *args, **kwargs)
for k, v in variables.items():
if isinstance(v, IndexVariable):
self._indexvars.add(k)

def get_variables(self):
def lazy_accessible(k, v):
dcherian marked this conversation as resolved.
Show resolved Hide resolved
if k in self._indexvars:
return v
data = indexing.LazilyIndexedArray(DuckBackendArrayWrapper(v.values))
return Variable(v.dims, data, v.attrs)

return {k: lazy_accessible(k, v) for k, v in self._variables.items()}


class TestDataset:
def test_repr(self) -> None:
data = create_test_data(seed=123)
Expand Down Expand Up @@ -4684,6 +4730,29 @@ def test_lazy_load(self) -> None:
ds.isel(time=10)
ds.isel(time=slice(10), dim1=[0]).isel(dim1=0, dim2=-1)

def test_lazy_load_duck_array(self) -> None:
store = AccessibleAsDuckArrayDataStore()
create_test_data().dump_to_store(store)

for decode_cf in [True, False]:
ds = open_dataset(store, decode_cf=decode_cf)
with pytest.raises(UnexpectedDataAccess):
ds["var1"].values

# these should not raise UnexpectedDataAccess:
ds.var1.data
ds.isel(time=10)
ds.isel(time=slice(10), dim1=[0]).isel(dim1=0, dim2=-1)
repr(ds)

# preserve the duck array type and don't cast to array
assert isinstance(ds["var1"].load().data, DuckArrayWrapper)
assert isinstance(
ds["var1"].isel(dim2=0, dim1=0).load().data, DuckArrayWrapper
)

ds.close()

def test_dropna(self) -> None:
x = np.random.randn(4, 4)
x[::2, 0] = np.nan
Expand Down
Loading