-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fancy indexing a Dataset with dask DataArray causes excessive memory usage #5054
Comments
AFAICT indexing with a dask array is mostly, if not always :P, broken (e.g. #2511) and this does look like a duplicate. I'll move your comment there. |
Thanks @dcherian. |
FWIW my use case actually only needs indexing a single dimension, i.e., something equivalent to the numpy (or dask.array) compress function. This can be hacked for xarray datasets in a fairly straightforward way: def _compress_dataarray(a, indexer, dim):
data = a.data
try:
axis = a.dims.index(dim)
except ValueError:
v = data
else:
# rely on __array_function__ to handle dispatching to dask if
# data is a dask array
v = np.compress(indexer, a.data, axis=axis)
if hasattr(v, 'compute_chunk_sizes'):
# needed to know dim lengths
v.compute_chunk_sizes()
return v
def compress_dataset(ds, indexer, dim):
if isinstance(indexer, str):
indexer = ds[indexer].data
coords = dict()
for k in ds.coords:
a = ds[k]
v = _compress_dataarray(a, indexer, dim)
coords[k] = (a.dims, v)
data_vars = dict()
for k in ds.data_vars:
a = ds[k]
v = _compress_dataarray(a, indexer, dim)
data_vars[k] = (a.dims, v)
attrs = ds.attrs.copy()
return xr.Dataset(data_vars=data_vars, coords=coords, attrs=attrs) Given the complexity of fancy indexing in general, I wonder if it's worth contemplating implementing a |
I have a dataset comprising several variables. All variables are dask arrays (e.g., backed by zarr). I would like to use one of these variables, which is a 1d boolean array, to index the other variables along a large single dimension. The boolean indexing array is about ~40 million items long, with ~20 million true values.
If I do this all via dask (i.e., not using xarray) then I can index one dask array with another dask array via fancy indexing. The indexing array is not loaded into memory or computed. If I need to know the shape and chunks of the resulting arrays I can call
compute_chunk_sizes()
, but still very little memory is required.If I do this via
xarray.Dataset.isel()
then a substantial amount of memory (several GB) is allocated duringisel()
and retained. This is problematic as in a real-world use case there are many arrays to be indexed and memory runs out on standard systems.There is a follow-on issue which is if I then want to run a computation over one of the indexed arrays, if the indexing was done via xarray then that leads to a further blow-up of multiple GB of memory usage, if using dask distributed cluster.
I think the underlying issue here is that the indexing array is loaded into memory, and then gets copied multiple times when the dask graph is constructed. If using a distributed scheduler, further copies get made during scheduling of any subsequent computation.
I made a notebook which illustrates the increased memory usage during
Dataset.isel()
here:https://colab.research.google.com/drive/1bn7Sj0An7TehwltWizU8j_l2OvPeoJyo?usp=sharing
This is possibly the same underlying issue (and use case) as raised by @eric-czech in #4663, so feel free to close this if you think it's a duplicate.
The text was updated successfully, but these errors were encountered: