Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized indexing (isel) of chunked data with 1D indices gives weird chunks #4555

Closed
benbovy opened this issue Oct 30, 2020 · 6 comments
Closed

Comments

@benbovy
Copy link
Member

benbovy commented Oct 30, 2020

What happened:

Applying .isel() on a DataArray or Dataset with chunked data using 1-d indices (either stored in a xarray.Variable or a numpy.ndarray) gives weird chunks (i.e., a lot of chunks with small sizes).

What you expected to happen:

More consistent chunk sizes.

Minimal Complete Verifiable Example:

Let's create a chunked DataArray

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: da = xr.DataArray(np.random.rand(100), dims='points').chunk(50)

In [4]: da
Out[4]:
<xarray.DataArray (points: 100)>
dask.array<xarray-<this-array>, shape=(100,), dtype=float64, chunksize=(50,), chunktype=numpy.ndarray>
Dimensions without coordinates: points

Select random indices results in a lot of small chunks

In [5]: indices = xr.Variable('nodes', np.random.choice(np.arange(100, dtype='int'), size=10))

In [6]: da_sel = da.isel(points=indices)

In [7]: da_sel.chunks
Out[7]: ((1, 1, 3, 1, 1, 3),)

What I would expect

In [8]: da.data.vindex[indices.data].chunks
Out[8]: ((10,),)

This works fine with 2+ dimensional indexers, e.g.,

In [9]: indices_2d = xr.Variable(('x', 'y'), np.random.choice(np.arange(100), size=(10, 10)))

In [10]: da_sel_2d = da.isel(points=indices_2d)

In [11]: da_sel_2d.chunks
Out[11]: ((10,), (10,))

Anything else we need to know?:

I suspect the issue is here:

# If all key is 1-dimensional and there are no duplicate labels,
# key can be mapped as an OuterIndexer.

In the example above I think we still want vectorized indexing (i.e., call dask.array.Array.vindex[] instead of dask.array.Array[]).

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:21:09)
[Clang 9.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.16.1
pandas: 1.1.3
numpy: 1.19.1
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.19.0
distributed: 2.25.0
matplotlib: 3.3.1
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: None
pytest: 5.4.3
IPython: 7.16.1
sphinx: 3.2.1

@dcherian
Copy link
Contributor

dcherian commented Mar 2, 2021

dask/dask#3648 seems related.

@phofl
Copy link
Contributor

phofl commented Aug 19, 2024

The 1D case works now, it uses the new take implementation that takes care of the chunks.

One thing that bothers me though is the 2D indexing, that just returns a single chunk and doesn't support Dask Arrays.

@dcherian 2 questions:

  • Is the 2D indexing example here a common use-case?
  • Would it be helpful if Dask-arrays as indexers are supported? (not sure how to get there tbh, just thinking about this)

@dcherian
Copy link
Contributor

Thanks @phofl

re: the nD case we sequentially apply each indexer here:

xarray/xarray/core/indexing.py

Lines 1618 to 1627 in ca2e9d6

def _oindex_get(self, indexer: OuterIndexer):
key = indexer.tuple
try:
return self.array[key]
except NotImplementedError:
# manual orthogonal indexing
value = self.array
for axis, subkey in reversed(list(enumerate(key))):
value = value[(slice(None),) * axis + (subkey,)]
return value

dask arrays as indexers does not help Xarray today since we need to construct output coordinates from the indexer, so we'll just compute it. I think there are definitely some uses for it. index using the output of dask.array.Array.argmax for example. I think there are open issues on dask/dask about it.

@phofl
Copy link
Contributor

phofl commented Aug 19, 2024

Isn't this one using vindex instead of oindex?

I have a PR here dask/dask#11330 that fixes this I think, I noticed that the vindex path seems to be more common than I expected.

@phofl
Copy link
Contributor

phofl commented Aug 19, 2024

import numpy as np
import xarray as xr

da = xr.DataArray(np.random.rand(100), dims='points').chunk(50)
indices_2d = xr.Variable(('x', 'y'), np.random.choice(np.arange(100), size=(100, 100)))
da_sel_2d = da.isel(points=indices_2d)

Before:

Screenshot 2024-08-19 at 21 01 36

After:

Screenshot 2024-08-19 at 21 00 52

@dcherian
Copy link
Contributor

Very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants