Fancy indexing a Dataset with dask DataArray causes excessive memory usage #5054

alimanfoo · 2021-03-18T15:45:08Z

I have a dataset comprising several variables. All variables are dask arrays (e.g., backed by zarr). I would like to use one of these variables, which is a 1d boolean array, to index the other variables along a large single dimension. The boolean indexing array is about ~40 million items long, with ~20 million true values.

If I do this all via dask (i.e., not using xarray) then I can index one dask array with another dask array via fancy indexing. The indexing array is not loaded into memory or computed. If I need to know the shape and chunks of the resulting arrays I can call compute_chunk_sizes(), but still very little memory is required.

If I do this via xarray.Dataset.isel() then a substantial amount of memory (several GB) is allocated during isel() and retained. This is problematic as in a real-world use case there are many arrays to be indexed and memory runs out on standard systems.

There is a follow-on issue which is if I then want to run a computation over one of the indexed arrays, if the indexing was done via xarray then that leads to a further blow-up of multiple GB of memory usage, if using dask distributed cluster.

I think the underlying issue here is that the indexing array is loaded into memory, and then gets copied multiple times when the dask graph is constructed. If using a distributed scheduler, further copies get made during scheduling of any subsequent computation.

I made a notebook which illustrates the increased memory usage during Dataset.isel() here:

https://colab.research.google.com/drive/1bn7Sj0An7TehwltWizU8j_l2OvPeoJyo?usp=sharing

This is possibly the same underlying issue (and use case) as raised by @eric-czech in #4663, so feel free to close this if you think it's a duplicate.

The text was updated successfully, but these errors were encountered:

dcherian · 2021-03-18T16:04:36Z

AFAICT indexing with a dask array is mostly, if not always :P, broken (e.g. #2511) and this does look like a duplicate. I'll move your comment there.

alimanfoo · 2021-03-18T16:39:59Z

Thanks @dcherian.

alimanfoo · 2021-03-18T16:45:51Z

FWIW my use case actually only needs indexing a single dimension, i.e., something equivalent to the numpy (or dask.array) compress function. This can be hacked for xarray datasets in a fairly straightforward way:

def _compress_dataarray(a, indexer, dim):
    data = a.data
    try:
        axis = a.dims.index(dim)
    except ValueError:
        v = data
    else:
        # rely on __array_function__ to handle dispatching to dask if
        # data is a dask array
        v = np.compress(indexer, a.data, axis=axis)
        if hasattr(v, 'compute_chunk_sizes'):
            # needed to know dim lengths
            v.compute_chunk_sizes()
    return v


def compress_dataset(ds, indexer, dim):
    if isinstance(indexer, str):
        indexer = ds[indexer].data
    
    coords = dict()
    for k in ds.coords:
        a = ds[k]
        v = _compress_dataarray(a, indexer, dim)
        coords[k] = (a.dims, v)
    
    data_vars = dict()
    for k in ds.data_vars:
        a = ds[k]
        v = _compress_dataarray(a, indexer, dim)
        data_vars[k] = (a.dims, v)
        
    attrs = ds.attrs.copy()
    
    return xr.Dataset(data_vars=data_vars, coords=coords, attrs=attrs)

Given the complexity of fancy indexing in general, I wonder if it's worth contemplating implementing a Dataset.compress() method as a first step.

dcherian closed this as completed Mar 18, 2021

dcherian mentioned this issue Mar 18, 2021

Fancy indexing a Dataset with dask DataArray triggers multiple computes #4663

Closed

alimanfoo mentioned this issue Mar 18, 2021

Work around memory and performance issues with xarray indexing malariagen/malariagen-data-python#21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fancy indexing a Dataset with dask DataArray causes excessive memory usage #5054

Fancy indexing a Dataset with dask DataArray causes excessive memory usage #5054

alimanfoo commented Mar 18, 2021

dcherian commented Mar 18, 2021

alimanfoo commented Mar 18, 2021

alimanfoo commented Mar 18, 2021 •

edited

Loading

Fancy indexing a Dataset with dask DataArray causes excessive memory usage #5054

Fancy indexing a Dataset with dask DataArray causes excessive memory usage #5054

Comments

alimanfoo commented Mar 18, 2021

dcherian commented Mar 18, 2021

alimanfoo commented Mar 18, 2021

alimanfoo commented Mar 18, 2021 • edited Loading

alimanfoo commented Mar 18, 2021 •

edited

Loading