-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor accessor + flexible indexes + Dask support #18
Conversation
Also add sklearn BallTree wrappers.
- Removed "tolerance" - Removed "transform" - Plugin IndexWrapper - Enable dask-based index builds
I think it's ready for review. @willirath @koldunovn if you want to take a look (I know, it's quite big). I don't know why lint tests are failing here (I run black on those files). Tests are running fine locally. We still need some more tests for the accessor. That can be done in #10 after merging this (probably a couple of merge conflicts to solve). I updated the example notebook. There's one performance issue when index coordinates are chunked (I think it's a xarray issue), but we can go ahead and address this later IMO. |
Opened pydata/xarray#4555 |
Let's also track pydata/xarray#2511 so that eventually we won't have to trigger computation of the dask graph in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I especially love how easy it is to add any indexer: https://github.com/ESM-VFC/xoak/pull/18/files#diff-436fac86313153a2042128086fef47f4996ea933c8d3697ffdbdbd68d50d3146
I'll give it a test with some bigger Dask based data and report back.
I've done some tests on a bigger server with the S2 point index, still using randomly located points. Settings:
Results:
Caveats: The results above were obtained after I spent a while on tweaking chunks, dask cluster and dask configuration (e.g., disable work stealing). It turns out to be very tricky in reality. Scaling is also highly limited by memory. There's one major issue here: during the query, the indexes get replicated on a growing number of dask workers, which causes memory blow-up and significant drop in performance due to index data serialization (which, in the case of the S2, means rebuilding indexes from scratch!). I suspect that the dask's scheduler has a poor idea on the memory footprint of those indexes (persisted in memory). This may explain why dask's heuristics to assign tasks to the workers miserably fails in this case. I need to figure out:
|
Turns out we just need to define the
It's pretty straightforward to do this using Despite those tricks, I still sometimes get workers restarted or indexes deleted for unknown reasons. I don't know, maybe I use too large amounts of data for the cluster I setup. With this approach, I'm afraid there's no way to completely avoid large amounts of data being transferred between workers, so scalability is still limited. On the positive side, we do really get very nice speed-ups with decent amounts of data (millions of points on my laptop)! |
Broadcasting the query point chunks to all workers may greatly help too: query_points = ... # dask array
query_points = query_points.persist()
client.replicate(query_points) This could be the best thing to do actually, since in most cases the whole query data should easily fit in each worker's memory. Now this is looking good: |
Ok let's merge this. Tests failed because of black (I suspect a version mismatch), so no harm here we can take care of this later. |
This PR will eventually be quite big (sorry, at this stage I think it'll be more productive overall than splitting things in many PRs).
Two goals:
xoak
to other indexes, reusing the same API for indexing Datasets / DataArraysTODO:
Flexible indexes
Registering a new custom index in
xoak
can be easily done with the help of a small adapter class that must implement thebuild
andquery
methods, e.g.,Any option to pass to the underlying index construction should be added as argument in the adapter class'
__init__
method (we could address later how query options are handled).In the example above,
my_index
is registered in xoak's available indexes. It can be selected like this:It's also possible to directly provide the adapter class (useful when one does not want to register an index adapter):
xoak.indexes
returns a mapping of all available indexes. As an alternative to the decorator above, it can also be used to register index adapters, e.g.,Dask support
This PR implements dask-enabled index build and/or query that is independent of the underlying index. It handles chunked coordinate variables for either or both index / query points.
For chunked index coordinates, a forest of index trees is built (one tree per chunk).
A query is executed in two stages:
query
method for each query array chunk and each tree of the forest.argmin
is applied on the distances array over the indexes dimension (columns)Advantages of this approach:
dask.persist()
. There's an option for that (enabled by default):Potential caveats:
Other changes
I removed the
transform
argument inDataset.xoak.set_index
and thetolerance
argument inDataset.xoak.sel
. Supported features may greatly vary from one index to another, so I'd suggest that most logic should be implemented in index adapter classes (with option forwarding), instead of trying to extend an API that is common to all indexes.Dataset.xoak.sel
triggers the computation of the query dask graph (if any), becausexarray.Dataset.sel
does not (yet) support dask arrays as indexers.I think we should leave
Dataset.xoak.sel
with as few features as possible and on par withxarray.Dataset.sel
. It would be great to leverage all the features of tree-based indexes (e.g., return distances, select k-nearest neighbors, lazy selection, set masks or max distances, etc.), but I think that thexoak
accessor should provide other methods for those special cases.