-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance of isel #2227
Comments
I don't have experience using Here's how I would recommend writing the query using label-based selection: %timeit ds.a.sel(time=slice(50_001, None))
117 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
@rabernat that's a good solution where it's a slice When is a time that it needs to align a bool array? If you try and pass an array of unequal length, it doesn't work anyway: In [12]: ds.a.isel(time=time_filter[:-1])
IndexError: Boolean array size 54999999 is used to index array with shape (55000000,). |
I am sorry @rabernat and @maxim-lian , |
Another part of the matrix of possibilities. Takes about half the time if you pass %timeit ds.a.isel(time=time_filter.values)
1.3 s ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
My measurements:
Given the size of this gap, I suspect this could be improved with some investigation and profiling, but there is certainly an upper-limit on the possible performance gain. One simple example is that indexing the dataset needs to index both |
I am looking into a similar performance issue with isel, but it seems that the issue is that it is creating arrays that are much bigger than needed. For my multidimensional case (time/x/y/window), what should end up only taking a few hundred MB is spiking up to 10's of GB of used RAM. Don't know if this might be a possible source of performance issues. |
@WeatherGod do you have a reproducible example? I'm happy to have a look |
Huh, strange... I just tried a simplified version of what I was doing (particularly, no dask arrays), and everything worked fine. I'll have to investigate further. |
Just for posterity, though, here is my simplified (working!) example:
|
Yeah, it looks like if |
@WeatherGod does adding something like |
No, it does not make a difference. The example above peaks at around 5GB of memory (a bit much, but manageable). And it peaks similarly if we chunk it like you suggested. |
@WeatherGod - are you reading data from netCDF files by chance? If so, can you share the compression/chunk layout for those ( |
It would be ten files opened via xr.open_mfdataset() concatenated across a time dimension, each one looking like:
|
In an effort to reduce the issue backlog, I'll close this, but please reopen if you disagree |
On master I'm seeing
Can someone else reproduce? |
Yes, I'm seeing similar numbers, about 10x slower indexing in a DataArray. This seems to have gotten slower over time. It would be good to track this down and add a benchmark! |
#3319 gives us about a 2x performance boost. It could likely be much faster, but at least this fixes the regression. |
Before #3319:
After #3319:
Good job! |
Can we short-circuit the special case where the index of the array used for slicing is the same object as the index being sliced, so no alignment is needed? >>> time_filter.time._variable is ds.time._variable
True
>>> %timeit xr.align(time_filter, ds.a)
477 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) the time spent on that align call could be zero! |
I think align tries to optimize that case, so maybe something's also possible there? |
Yes, align checks The real mystery here is why |
Works by propagating indexes in DataArray._replace. xref pydata#2227. Tests pass!
* Propagate indexes in DataArray binary operations. Works by propagating indexes in DataArray._replace. xref #2227. Tests pass! * remove commented code. * fix roll
Hi, I'd like to understand how |
I don't know much about indexing but that PR propagates a "new" indexes property as part of #1603 (work towards enabling more flexible indexing), it doesn't change anything about "indexing". I think the dask docs may be more relevant to what you may be asking about: https://docs.dask.org/en/latest/array-slicing.html |
I just changed
to:
And that changed the runtime of my code from (unknown, still running after 3 hours) to around 10 seconds.
|
@dschwoerer are you sure that you are actually calculating the same thing in both cases? What exactly do the values of |
I see, they are not the same - the slow one is still a dask array, the other one is not:
Otherwise they are the same, so this might be dask related ... |
A reproducible example would help but indexing with dask arrays is a bit limited. On #5873 it's possible it will raise an error and ask you to compute the indexer. Also see dask/dask#4156 EDIT: your slowdown is probably because it's compuing |
Hi,
I get a very slow performance of Dataset.isel or DataArray.isel in comparison with the native numpy approach. Do you know where this comes from?
Select some values with DataArray.isel:
2.22 s ± 375 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use the native numpy approach:
163 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
xarray: 0.10.4
pandas: 0.23.0
numpy: 1.14.2
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.5.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.5
distributed: 1.21.8
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: 0.8.1
setuptools: 39.1.0
pip: 9.0.3
conda: None
pytest: 3.5.1
IPython: 6.4.0
sphinx: 1.7.4
The text was updated successfully, but these errors were encountered: