Vectorized indexing (isel) of chunked data with 1D indices gives weird chunks #4555

benbovy · 2020-10-30T10:55:33Z

What happened:

Applying .isel() on a DataArray or Dataset with chunked data using 1-d indices (either stored in a xarray.Variable or a numpy.ndarray) gives weird chunks (i.e., a lot of chunks with small sizes).

What you expected to happen:

More consistent chunk sizes.

Minimal Complete Verifiable Example:

Let's create a chunked DataArray

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: da = xr.DataArray(np.random.rand(100), dims='points').chunk(50)

In [4]: da
Out[4]:
<xarray.DataArray (points: 100)>
dask.array<xarray-<this-array>, shape=(100,), dtype=float64, chunksize=(50,), chunktype=numpy.ndarray>
Dimensions without coordinates: points

Select random indices results in a lot of small chunks

In [5]: indices = xr.Variable('nodes', np.random.choice(np.arange(100, dtype='int'), size=10))

In [6]: da_sel = da.isel(points=indices)

In [7]: da_sel.chunks
Out[7]: ((1, 1, 3, 1, 1, 3),)

What I would expect

In [8]: da.data.vindex[indices.data].chunks
Out[8]: ((10,),)

This works fine with 2+ dimensional indexers, e.g.,

In [9]: indices_2d = xr.Variable(('x', 'y'), np.random.choice(np.arange(100), size=(10, 10)))

In [10]: da_sel_2d = da.isel(points=indices_2d)

In [11]: da_sel_2d.chunks
Out[11]: ((10,), (10,))

Anything else we need to know?:

I suspect the issue is here:

xarray/xarray/core/variable.py

Lines 616 to 617 in 063606b

    
           # If all key is 1-dimensional and there are no duplicate labels, 
        
           # key can be mapped as an OuterIndexer.

In the example above I think we still want vectorized indexing (i.e., call dask.array.Array.vindex[] instead of dask.array.Array[]).

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:21:09)
[Clang 9.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.16.1
pandas: 1.1.3
numpy: 1.19.1
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.19.0
distributed: 2.25.0
matplotlib: 3.3.1
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: None
pytest: 5.4.3
IPython: 7.16.1
sphinx: 3.2.1

The text was updated successfully, but these errors were encountered:

dcherian · 2021-03-02T17:36:48Z

dask/dask#3648 seems related.

phofl · 2024-08-19T13:50:49Z

The 1D case works now, it uses the new take implementation that takes care of the chunks.

One thing that bothers me though is the 2D indexing, that just returns a single chunk and doesn't support Dask Arrays.

@dcherian 2 questions:

Is the 2D indexing example here a common use-case?
Would it be helpful if Dask-arrays as indexers are supported? (not sure how to get there tbh, just thinking about this)

dcherian · 2024-08-19T15:38:37Z

Thanks @phofl

re: the nD case we sequentially apply each indexer here:

xarray/xarray/core/indexing.py

Lines 1618 to 1627 in ca2e9d6

    
           def _oindex_get(self, indexer: OuterIndexer): 
        
               key = indexer.tuple 
        
               try: 
        
                   return self.array[key] 
        
               except NotImplementedError: 
        
                   # manual orthogonal indexing 
        
                   value = self.array 
        
                   for axis, subkey in reversed(list(enumerate(key))): 
        
                       value = value[(slice(None),) * axis + (subkey,)] 
        
                   return value

dask arrays as indexers does not help Xarray today since we need to construct output coordinates from the indexer, so we'll just compute it. I think there are definitely some uses for it. index using the output of dask.array.Array.argmax for example. I think there are open issues on dask/dask about it.

phofl · 2024-08-19T19:00:40Z

Isn't this one using vindex instead of oindex?

I have a PR here dask/dask#11330 that fixes this I think, I noticed that the vindex path seems to be more common than I expected.

phofl · 2024-08-19T19:01:55Z

import numpy as np
import xarray as xr

da = xr.DataArray(np.random.rand(100), dims='points').chunk(50)
indices_2d = xr.Variable(('x', 'y'), np.random.choice(np.arange(100), size=(100, 100)))
da_sel_2d = da.isel(points=indices_2d)

Before:

After:

dcherian · 2024-08-19T19:02:52Z

Very nice!

benbovy mentioned this issue Oct 30, 2020

Refactor accessor + flexible indexes + Dask support xarray-contrib/xoak#18

Merged

3 tasks

dcherian added the topic-indexing label Nov 4, 2020

dcherian added the topic-dask label Jan 17, 2021

dcherian mentioned this issue Jul 22, 2024

Groupby-map is slow with out of order indices #9220

Open

dcherian added the upstream issue label Aug 21, 2024

dcherian closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized indexing (isel) of chunked data with 1D indices gives weird chunks #4555

Vectorized indexing (isel) of chunked data with 1D indices gives weird chunks #4555

benbovy commented Oct 30, 2020 •

edited

Loading

INSTALLED VERSIONS

dcherian commented Mar 2, 2021

phofl commented Aug 19, 2024

dcherian commented Aug 19, 2024

phofl commented Aug 19, 2024

phofl commented Aug 19, 2024 •

edited

Loading

dcherian commented Aug 19, 2024

Vectorized indexing (isel) of chunked data with 1D indices gives weird chunks #4555

Vectorized indexing (isel) of chunked data with 1D indices gives weird chunks #4555

Comments

benbovy commented Oct 30, 2020 • edited Loading

INSTALLED VERSIONS

dcherian commented Mar 2, 2021

phofl commented Aug 19, 2024

dcherian commented Aug 19, 2024

phofl commented Aug 19, 2024

phofl commented Aug 19, 2024 • edited Loading

dcherian commented Aug 19, 2024

benbovy commented Oct 30, 2020 •

edited

Loading

phofl commented Aug 19, 2024 •

edited

Loading