Slow performance of isel #2227

JohnMrziglod · 2018-06-12T16:46:14Z

Hi,

I get a very slow performance of Dataset.isel or DataArray.isel in comparison with the native numpy approach. Do you know where this comes from?

ds = xr.Dataset(
    {
        "a": ("time", np.arange(55_000_000))
    }, coords={
        "time": np.arange(55_000_000)
    }
)
time_filter = ds.time > 50_000

Select some values with DataArray.isel:

%timeit ds.a.isel(time=time_filter)

2.22 s ± 375 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use the native numpy approach:

%timeit ds.a.values[time_filter]

163 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 3.16.0-4-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.utf8 LOCALE: en_US.UTF-8

xarray: 0.10.4
pandas: 0.23.0
numpy: 1.14.2
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.5.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.5
distributed: 1.21.8
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: 0.8.1
setuptools: 39.1.0
pip: 9.0.3
conda: None
pytest: 3.5.1
IPython: 6.4.0
sphinx: 1.7.4

The text was updated successfully, but these errors were encountered:

rabernat · 2018-06-12T16:55:55Z

I don't have experience using isel with boolean indexing. (Although the docs on positional indexing claim it is supported.) My guess is that that the time is being spent aligning the indexer with the array, which is unnecessary since you know they are already aligned. Probably not the most efficient pattern for xarray.

Here's how I would recommend writing the query using label-based selection:

%timeit ds.a.sel(time=slice(50_001, None))
117 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

max-sixty · 2018-06-12T17:02:34Z

@rabernat that's a good solution where it's a slice

When is a time that it needs to align a bool array? If you try and pass an array of unequal length, it doesn't work anyway:

In [12]: ds.a.isel(time=time_filter[:-1])

IndexError: Boolean array size 54999999 is used to index array with shape (55000000,).

JohnMrziglod · 2018-06-12T17:45:14Z

I am sorry @rabernat and @maxim-lian ,
the variable's name time and the simple example with the greater than filter are misleading. In general, it is about using a boolean mask via isel and that it is very slow. In my code, I am not able to use your workaround since my boolean mask is more complex.

rabernat · 2018-06-12T17:45:48Z

Another part of the matrix of possibilities. Takes about half the time if you pass time_filter.values (numpy array) rather than the time_filter DataArray:

%timeit ds.a.isel(time=time_filter.values)
1.3 s ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shoyer · 2018-06-12T20:38:47Z

My measurements:

>>> %timeit ds.a.isel(time=time_filter)
1 loop, best of 3: 906 ms per loop
>>> %timeit ds.a.isel(time=time_filter.values)
1 loop, best of 3: 447 ms per loop
>>> %timeit ds.a.values[time_filter]
10 loops, best of 3: 169 ms per loop

Given the size of this gap, I suspect this could be improved with some investigation and profiling, but there is certainly an upper-limit on the possible performance gain.

One simple example is that indexing the dataset needs to index both 'a' and 'time', so it's going to be at least twice as slow as only indexing 'a'. So the second indexing expression ds.a.isel(time=time_filter.values) is only 447/(169*2) = 1.32 times slower than the best case scenario.

WeatherGod · 2018-09-25T19:27:28Z

I am looking into a similar performance issue with isel, but it seems that the issue is that it is creating arrays that are much bigger than needed. For my multidimensional case (time/x/y/window), what should end up only taking a few hundred MB is spiking up to 10's of GB of used RAM. Don't know if this might be a possible source of performance issues.

max-sixty · 2018-09-25T19:35:57Z

@WeatherGod do you have a reproducible example? I'm happy to have a look

WeatherGod · 2018-09-25T19:48:20Z

Huh, strange... I just tried a simplified version of what I was doing (particularly, no dask arrays), and everything worked fine. I'll have to investigate further.

WeatherGod · 2018-09-25T19:54:59Z

Just for posterity, though, here is my simplified (working!) example:

import numpy as np
import xarray as xr

da = xr.DataArray(np.random.randn(10, 3000, 7000),
                  dims=('time', 'latitude', 'longitude'))
window = da.rolling(time=2).construct('win')
indexes = window.argmax(dim='win')
result = window.isel(win=indexes)

WeatherGod · 2018-09-25T20:14:02Z

Yeah, it looks like if da is backed by a dask array, and you do a .isel(win=window.compute()) because otherwise isel barfs on dask indexers, it seems, then the memory usage shoots through the roof. Note that in my case, the dask chunks are (1, 3000, 7000). If I do a window.load() prior to window.isel(), then the memory usage is perfectly reasonable.

shoyer · 2018-09-26T00:54:24Z

@WeatherGod does adding something like da = da.chunk({'time': 1}) reproduce this with your example?

WeatherGod · 2018-09-26T17:06:44Z

No, it does not make a difference. The example above peaks at around 5GB of memory (a bit much, but manageable). And it peaks similarly if we chunk it like you suggested.

jhamman · 2018-09-27T03:16:40Z

@WeatherGod - are you reading data from netCDF files by chance?

If so, can you share the compression/chunk layout for those (ncdump -h -s file.nc)?

WeatherGod · 2018-09-27T20:05:05Z

It would be ten files opened via xr.open_mfdataset() concatenated across a time dimension, each one looking like:

netcdf convect_gust_20180301_0000 {
dimensions:
	latitude = 3502 ;
	longitude = 7002 ;
variables:
	double latitude(latitude) ;
		latitude:_FillValue = NaN ;
		latitude:_Storage = "contiguous" ;
		latitude:_Endianness = "little" ;
	double longitude(longitude) ;
		longitude:_FillValue = NaN ;
		longitude:_Storage = "contiguous" ;
		longitude:_Endianness = "little" ;
	float gust(latitude, longitude) ;
		gust:_FillValue = NaNf ;
		gust:units = "m/s" ;
		gust:description = "gust winds" ;
		gust:_Storage = "chunked" ;
		gust:_ChunkSizes = 701, 1401 ;
		gust:_DeflateLevel = 8 ;
		gust:_Shuffle = "true" ;
		gust:_Endianness = "little" ;

// global attributes:
		:start_date = "03/01/2018 00:00" ;
		:end_date = "03/01/2018 01:00" ;
		:interval = "half-open" ;
		:init_date = "02/28/2018 22:00" ;
		:history = "Created 2018-09-12 15:53:44.468144" ;
		:description = "Convective Downscaling, format V2.0" ;
		:_NCProperties = "version=1|netcdflibversion=4.6.1|hdf5libversion=1.10.1" ;
		:_SuperblockVersion = 0 ;
		:_IsNetcdf4 = 1 ;
		:_Format = "netCDF-4" ;

max-sixty · 2019-01-14T21:09:49Z

In an effort to reduce the issue backlog, I'll close this, but please reopen if you disagree

dcherian · 2019-09-18T17:14:38Z

On master I'm seeing

%timeit ds.a.isel(time=time_filter)
3.65 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ds.a.isel(time=time_filter.values)
2.99 s ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ds.a.values[time_filter]
227 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Can someone else reproduce?

shoyer · 2019-09-18T17:33:38Z

Yes, I'm seeing similar numbers, about 10x slower indexing in a DataArray. This seems to have gotten slower over time. It would be good to track this down and add a benchmark!

shoyer · 2019-09-18T18:17:22Z

#3319 gives us about a 2x performance boost. It could likely be much faster, but at least this fixes the regression.

crusaderky · 2019-09-19T08:49:32Z

Before #3319:


%timeit ds.a.values[time_filter]
158 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit ds.a.isel(time=time_filter.values)
2.57 s ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ds.a.isel(time=time_filter)
3.12 s ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After #3319:

%timeit ds.a.values[time_filter]
158 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit ds.a.isel(time=time_filter.values)
665 ms ± 6.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ds.a.isel(time=time_filter)
1.15 s ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Good job!

crusaderky · 2019-09-19T08:57:44Z

Can we short-circuit the special case where the index of the array used for slicing is the same object as the index being sliced, so no alignment is needed?

>>> time_filter.time._variable is ds.time._variable
True
>>> %timeit xr.align(time_filter, ds.a)
477 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

the time spent on that align call could be zero!

dcherian · 2019-09-19T13:00:40Z

I think align tries to optimize that case, so maybe something's also possible there?

shoyer · 2019-09-19T15:49:24Z

Yes, align checks index.equals(other) first, which has a shortcut for the same object.

The real mystery here is why time_filter.indexes['time'] and ds.indexes['time'] are not the same object. I guess this is likely due to lazy initialization of indexes, and should be fixed eventually by the explicit indexes refactor.

Works by propagating indexes in DataArray._replace. xref pydata#2227. Tests pass!

* Propagate indexes in DataArray binary operations. Works by propagating indexes in DataArray._replace. xref #2227. Tests pass! * remove commented code. * fix roll

Hoeze · 2019-11-26T15:54:25Z

Hi, I'd like to understand how isel works exactly in conjunction with dask arrays.
As it seems, #3481 propagates the isel operation onto each dask chunk for lazy evaluation. Is this correct?

dcherian · 2019-11-26T16:08:24Z

I don't know much about indexing but that PR propagates a "new" indexes property as part of #1603 (work towards enabling more flexible indexing), it doesn't change anything about "indexing". I think the dask docs may be more relevant to what you may be asking about: https://docs.dask.org/en/latest/array-slicing.html

dschwoerer · 2023-03-10T14:36:43Z

I just changed

        theisel = ds[k].isel(**slc, missing_dims="ignore")

to:

         slcp = [slc[d] if d in slc else slice(None) for d in ds[k].dims]
         theisel = ds[k].values[tuple(slcp)]

And that changed the runtime of my code from (unknown, still running after 3 hours) to around 10 seconds.

ds[k] is a 3 dimensional array
slc[d] are 7-d numpy array of integers

shoyer · 2023-03-10T18:04:23Z

@dschwoerer are you sure that you are actually calculating the same thing in both cases? What exactly do the values of slc[d] look like? I would test thing on smaller inputs to verify. My guess is that you are inadvertently calculating something different, recalling that Xarray's broadcasting rules differ slightly from NumPy's.

dschwoerer · 2023-03-14T11:32:10Z

I see, they are not the same - the slow one is still a dask array, the other one is not:

     Sn       (r, theta, phi, sampling) float64 dask.array<chunksize=(14, 52, 2, 10), meta=np.ndarray>,
     Sn       (r, theta, phi, sampling) float64 nan nan nan nan ... nan nan nan

Otherwise they are the same, so this might be dask related ...

dcherian · 2023-03-14T18:49:51Z

A reproducible example would help but indexing with dask arrays is a bit limited.

On #5873 it's possible it will raise an error and ask you to compute the indexer. Also see dask/dask#4156

EDIT: your slowdown is probably because it's compuing Sn multiple times. You could speed it up by calling compute once and passing a numpy array to isel

shoyer added the topic-performance label Jun 12, 2018

max-sixty closed this as completed Jan 14, 2019

dcherian reopened this Sep 18, 2019

shoyer mentioned this issue Sep 18, 2019

Fix isel performance regression #3319

Merged

1 task

dcherian mentioned this issue Oct 29, 2019

grid.diff dropping coordinates xgcm/xgcm#150

Closed

dcherian added a commit to dcherian/xarray that referenced this issue Nov 4, 2019

Propagate indexes in DataArray binary operations.

6958172

Works by propagating indexes in DataArray._replace. xref pydata#2227. Tests pass!

dcherian mentioned this issue Nov 4, 2019

Propagate indexes in DataArray binary operations. #3481

Merged

2 tasks

dcherian added a commit that referenced this issue Nov 5, 2019

Propagate indexes in DataArray binary operations. (#3481)

b649846

* Propagate indexes in DataArray binary operations. Works by propagating indexes in DataArray._replace. xref #2227. Tests pass! * remove commented code. * fix roll

dcherian mentioned this issue Dec 3, 2020

Comprehensive benchmarking suite #4648

Open

19 tasks

LiangdeLI mentioned this issue Oct 13, 2021

Bottleneck in terms of time consuming of HWE benchmark sgkit-dev/sgkit#736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance of isel #2227

Slow performance of isel #2227

JohnMrziglod commented Jun 12, 2018

rabernat commented Jun 12, 2018

max-sixty commented Jun 12, 2018

JohnMrziglod commented Jun 12, 2018

rabernat commented Jun 12, 2018

shoyer commented Jun 12, 2018

WeatherGod commented Sep 25, 2018

max-sixty commented Sep 25, 2018

WeatherGod commented Sep 25, 2018

WeatherGod commented Sep 25, 2018

WeatherGod commented Sep 25, 2018

shoyer commented Sep 26, 2018

WeatherGod commented Sep 26, 2018

jhamman commented Sep 27, 2018

WeatherGod commented Sep 27, 2018

max-sixty commented Jan 14, 2019

dcherian commented Sep 18, 2019

shoyer commented Sep 18, 2019

shoyer commented Sep 18, 2019

crusaderky commented Sep 19, 2019

crusaderky commented Sep 19, 2019

dcherian commented Sep 19, 2019

shoyer commented Sep 19, 2019

Hoeze commented Nov 26, 2019

dcherian commented Nov 26, 2019

dschwoerer commented Mar 10, 2023

shoyer commented Mar 10, 2023

dschwoerer commented Mar 14, 2023

dcherian commented Mar 14, 2023 •

edited

Loading

Slow performance of isel #2227

Slow performance of isel #2227

Comments

JohnMrziglod commented Jun 12, 2018

rabernat commented Jun 12, 2018

max-sixty commented Jun 12, 2018

JohnMrziglod commented Jun 12, 2018

rabernat commented Jun 12, 2018

shoyer commented Jun 12, 2018

WeatherGod commented Sep 25, 2018

max-sixty commented Sep 25, 2018

WeatherGod commented Sep 25, 2018

WeatherGod commented Sep 25, 2018

WeatherGod commented Sep 25, 2018

shoyer commented Sep 26, 2018

WeatherGod commented Sep 26, 2018

jhamman commented Sep 27, 2018

WeatherGod commented Sep 27, 2018

max-sixty commented Jan 14, 2019

dcherian commented Sep 18, 2019

shoyer commented Sep 18, 2019

shoyer commented Sep 18, 2019

crusaderky commented Sep 19, 2019

crusaderky commented Sep 19, 2019

dcherian commented Sep 19, 2019

shoyer commented Sep 19, 2019

Hoeze commented Nov 26, 2019

dcherian commented Nov 26, 2019

dschwoerer commented Mar 10, 2023

shoyer commented Mar 10, 2023

dschwoerer commented Mar 14, 2023

dcherian commented Mar 14, 2023 • edited Loading

dcherian commented Mar 14, 2023 •

edited

Loading