-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datashader #290
Datashader #290
Conversation
I haven't gotten a chance to convert issue272_case1.py, and probably won't get a chance between now and Thanskgiving as I have a very busy week ahead. That's a pretty plot, though! From a quick skim of the notes about it:
|
@jbednar thank you very much for this demonstration. I am also cc'ing @clyne to continue our discussions here. I'd like to share a few thoughts and findings from my first pass:
We'd have further analysis and discussions. |
Just wanted to note that the 3km resolution is significant because that is what is used by many in the global numerical weather prediction research community. There are also groups running at ~1km resolution. |
Does a 3km or 1km resolution grid fit in memory on your machine? If not, you'd need to use a Dask-backed xarray, either out of core or distributed across multiple machines. Many of the people in Pangeo could help you get set up with an example of doing that, if that's the issue here... |
@jbednar, can datashader.rasterize() be parallelized? |
Yes, if you provide a Dask array with your data; it will use whatever Dask workers are available to work on that array. |
Thanks. Are there any examples you can point to? My attempts to daskize the TriMesh thus far result in a runaway process, but I could be doing something stupid. Ignoring chunking for now:
|
Oh, if it's for trimesh, what's supported is a Dask DataFrame, not a Dask Array (not sure if that's what you've got in the code snippet). The matrix of what type of data object is supported for which type of plot is shown in a table at https://datashader.org/user_guide/Performance.html#Data-objects . As you can see, there are a lot of combinations that are supported, most of which are difficult to build and test on CI where multi-node and GPU jobs are painful and expensive. We have not done a good job of showing a tested example of each supported combination; instead for multi-node and GPU examples there are just some random notebooks linked from PRs, etc. It's definitely important to get good multi-node and GPU examples together, but it will take a lot of work that we're not funded for, so it's a hole in what we're currently offering. If just switching to a Dask DataFrame isn't enough to get you going with this example, let me know, and I can see if we can make the specific example use Dask. And if anyone knows of someone willing to volunteer or fund such work as a group, I'd be happy to supervise and guide it! |
I'm afraid I'm flailing on this. I've not been able to figure out how to feed hv.TriMesh a Dask DataFrame that it likes. I'm hoping to discuss datashader+dask at a talk I'm giving at RMACC next week if I can get this sorted out. Any further guidance would be greatly appreciated. Thanks! |
I've asked one of my colleagues to adapt a trimesh example to use a Dask DataFrame for you to have as a guide. Hopefully he can do that soon! |
Awesome. Thanks so much! |
Ok, here's the result: holoviz/holoviews#4927 |
Thanks much, Jim. Greatly appreciated. I'll see if we can get it adapted ASAP and report back. |
I was able to adapt the code here holoviz/holoviews#4927 to a data set with ~80M triangles. |
Right; with Dask there's a first step of "get it working with Dask", and then a separate step of "once it works with Dask, configure a scheduler and set of workers to make good use of the available hardware". To get started with this second step, try https://distributed.dask.org/en/latest/quickstart.html, or if you've already set up the distributed scheduler and configured workers appropriately, please post the working (but not sped up) code that you have so far (and corresponding dataset) and we can see if we can spot something. |
That said, I'm surprised that the default configuration isn't automatically using all the available cores on a single-CPU machine. We'll look at that tomorrow using our examples and see if we can spot anything amiss. |
I've attached the notebook here. The data can be download from: https://drive.google.com/drive/folders/17uxAJCm81qPA1fuwukzHaz4AnWDK7-OX?usp=sharing Thanks!! |
Okay I had a look at it and discovered that the dask codepath in Datashader was indeed not being exercised by HoloViews. I've fixed that in this PR holoviz/holoviews#4935 and will release a new version of HoloViews with this fix (version 1.14.4) some time today. I've also taken the liberty to clean up the notebook: MPAS_Datashader_Trimesh.ipynb.zip Here's a gif of the rasterization in action: |
Outstanding! Thank you much, @philippjfr and @jbednar. With the code restructuring even the serial version is performing much better. I look forward to getting your fix incorporated and will report back later. Is there a way to instrument the rasterizer to time the rendering updates (or at least the first rendering)? |
It would be easy to time Datashader when it's used as a standalone library rendering to an xarray object, but I'm not sure if there is any easy way to time Datashader as it's used inside HoloViews. If it's ok to include the entire process of rendering and plotting you can do something like: I'd run that a few times to see if it varies, e.g. if something needed to be imported the first time. |
Works great. Thanks! |
Speculative branch for exploring Datashader, HoloViews, and GeoViews. Fine to merge to the
datashader
branch, but not to master.So far, includes changes to three files:
hv.Curve(vz, label="V") * hv.Curve(uz, label="U")
)jupyter notebook NCL_leg_1.ipynb
, then Run Allfactor
for selecting the resolution; higher values make it quicker to runThe full timings for issue272_case2i_ds.py on my MacBook Pro are:
$ python issue272_case2i_ds.py 200
MPL Triangulation takes: 0.0015921592712402344 seconds
SciPy Triangulation takes: 0.002576112747192383 seconds
Datashader-based rendering 1 takes: 1.8307688236236572 seconds
Datashader-based rendering 2 takes: 0.3103179931640625 seconds
Matplotlib-based rendering takes: 0.17391610145568848 seconds
$ python issue272_case2i_ds.py 10
MPL Triangulation takes: 0.7670328617095947 seconds
SciPy Triangulation takes: 0.8238921165466309 seconds
Datashader-based rendering 1 takes: 1.839228868484497 seconds
Datashader-based rendering 2 takes: 0.40596699714660645 seconds
Matplotlib-based rendering takes: 1.5457789897918701 seconds
$ python issue272_case2i_ds.py 5
MPL Triangulation takes: 3.4231629371643066 seconds
SciPy Triangulation takes: 3.4271531105041504 seconds
Datashader-based rendering 1 takes: 1.807750940322876 seconds
Datashader-based rendering 2 takes: 0.47564196586608887 seconds
Matplotlib-based rendering takes: 4.945454120635986 seconds
$ python issue272_case2i_ds.py 1
MPL Triangulation takes: 704.985121011734 seconds
SciPy Triangulation takes: 529.0948584079742 seconds
Datashader-based rendering 1 takes: 9.263015031814575 seconds
Datashader-based rendering 2 takes: 3.3065507411956787 seconds
Matplotlib-based rendering takes: 113.213791847229 seconds
Note that in issue272_case2i_ds.py the
np.arange()
calls don't actually coverlon_max
orlat_max
, which you can see clearly in the factor=200 run (big whitespace on top and right of plot). Probablynp.linspace()
is more appropriate here, but even then you have to worry what happens at the latitude edge; presumably those values should be copied from one edge to the other.Takeaways