Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory explosion when auto-calculating canvas range extents with dask #717

Merged
merged 1 commit into from
Feb 23, 2019

Conversation

jonmmease
Copy link
Collaborator

This PR improves the Canvas autorange logic to avoid bringing each full x/y array into memory in order to compute the min/max values.

This fixes #668 for me. Here is the test case I I've been using to diagnose and test this PR. I started a Dask distributed instance on a large workstation and persisted the ~3-billion point OSM dataset into memory. After this persist, the Dask dashboard reports ~28GB memory used. Then I set up a VM with 8GB of RAM and connected it as a client of the distributed scheduler. I then performed a cvs.points aggregation without specifying x_range/y_range, causing the autorange logic to be invoked.

import dask.dataframe as dd
import datashader as ds
osm = dd.read_parquet('/path/to/osm-3billion.parq/').persist()
cvs = ds.Canvas()
agg = cvs.points(osm, x='x', y='y')

Before these changes, the memory usage of the client would climb steadily until the kernel died. With these changes, the aggregation completes successfully with no noticeable increase in memory usage on the client.

cc @jacobtomlinson

@jbednar
Copy link
Member

jbednar commented Feb 23, 2019

Looks great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory errors on distributed dask cluster
2 participants