-
-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory errors on distributed dask cluster #668
Comments
@jbednar is this just the way it works or are we doing something wrong? |
From that description, I'm not sure exactly what you mean. In my own work, I've previously used |
Persist makes sure that chunks of data are around on some worker. That data is usually not on the main machine. Then I would expect datashader to do some complex groupby aggregation and calls compute. df.groupby([df.x.round(...), df.y.round(...)]).count().compute() When you call compute you get a pandas dataframe on the client. I would expect this to be less than the full 20GB. Sometimes people have calls like If I were to diagnose this I would probably look at the code and see any place where compute/persist was being called. I might also put in a bit of logging information and sleeps so that I could try to narrow down where in the datashader pipeline things are pouring data into the client/jupyter process. |
Groupby aggregations always end up in a single partition. I would expect the output of datashader's process to have 900 * 525 rows, so I'm not sure I undertand this. |
Fair enough. But as you say I would imagine this partition to be 900 * 525. Unless the reduction (mean, count, etc) is performed after the data is gathered into the partition? This could result in the 900 * 525 dataframe having all of the data points in each row. |
groupby-aggregations are computed by doing groupby aggregations on the partitions, then merging a few, doing more groupby-aggreations on those, and so on in a tree reduction until we get to a final result. There is never much memory in any particular partition (assuming that the number of groups is managable) As an example, we accomplish a groupby-mean by doing a groupby-sum and groupby-count on each partition, then doing a groupby-sum on both of those until we get down to one, then dividing the result on the final partition. However, datashader does different things than dask.dataframe. I'm not as familiar with their algorithms, but I suspect that they do something similar. |
Hi @jacobtomlinson, Wanted to let you know that I'm planning to take a look at this, as it's definitely an important usecase (and it's something that Datashader+Dask should be able to handle). But unfortunately it probably won't be until early February that I'll have a compute/storage environment setup to be able to reproduce what you're seeing. |
@jonmmease thanks for looking into this! If you want access to our JupyterHub/Dask environment for testing and reproduction then let me know and we can get you an account on there. |
Thanks for the offer! Once I start digging in I'll let you know if it looks like that would help nail things down. |
In working on another project, I just realized that Datashader's glyph autorange logic for Dask calls numpy's I stepped into these functions in the debugger and it looks like numpy handles these functions by converting the entire dask array into an in-memory numpy array before computing the min/max. This is something we can improve by writing a custom @jacobtomlinson, in your example above it looks like you're not specifying ddf = client.persist(dd.read_parquet('Some 20GB dataset'))
x0 = ddf['x'].min().compute()
x1 = ddf['x'].max().compute()
y0 = ddf['y'].min().compute()
y1 = ddf['y'].max().compute()
cvs = ds.Canvas(900, 525, x_range=(x0, x1), y_range=(y0, y1))
agg = cvs.points(ddf, 'x', 'y', agg=ds.mean('z')) |
I'll give this a go! |
Sadly I'm still getting the same issue. |
Thanks for giving it a try. My hunch is that this isn't the only place where this kind of thing is happening. |
Yes this is my feeling too. There will be somewhere (or multiple places) where the distributed array is accidentally pulled together. |
One solution to this would be to implement NEP-18 for dask arrays. In that case all operations like The equivalent PR for cupy is here: cupy/cupy#1650 |
Description
I have a persisted dask dataframe which is larger than the amount of memory on my notebook server and any of the individual workers. The structure is
x, y, z
lidar data.When trying to plot with datashader it seems to attempt to transfer the whole dataframe to the notebook when aggregating before plotting.
This results in 20GB of data being transferred to my notebook (and it gets killed by the OOM killer as I only have 16GB of RAM).
Your environment
Datashader version: 0.6.8
Dask version: 0.20.0
Distributed version: 1.24.0
The text was updated successfully, but these errors were encountered: