Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask options for Iris processing #2511

Merged
merged 3 commits into from
Apr 25, 2017
Merged

Dask options for Iris processing #2511

merged 3 commits into from
Apr 25, 2017

Conversation

DPeterK
Copy link
Member

@DPeterK DPeterK commented Apr 25, 2017

A much simplified replacement for #2462 and thus also a fix for #2403.

Instead of having a highly complex Iris option class that places an Iris API around the "non-API" of dask.set_options and potentially tying ourselves into a very tight coupling to dask.set_options, this PR just calls dask.set_options from within iris._lazy_data at import time. The call to dask.set_options is conditional on the user having not already specified dask options so that importing Iris does not change dask options that are not the default.

The only dask option that may be set by Iris is setting the get function to the synchronous scheduler dask.async.get_sync. This scheduler only runs on a single thread, which matches to the requirement suggested in #2403. In testing this scheduler was shown to be faster than using the conceptually equivalent dask options {'get': dask.threaded.get, 'pool': ThreadPool(1)}, and simpler to construct/maintain.

Iris is running single-threaded using `dask.async.get_sync`. This default
ensures that running Iris under "normal" conditions will not use up all
available computational resource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could usefully explain what dask does by default, that we don't want.
E.G. "Otherwise, by default, dask.array will use a threaded scheduler and grab all available CPUs."

set_options = 'dask.set_options'
self.patch_set_options = self.patch(set_options)
get_sync = 'dask.async.get_sync'
self.patch_get_sync = self.patch(get_sync)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we need to patch this at all.
It's just a value, defined by dask, we can just use it anyway.
Then instead of assert_called_once_with(get=self.patch_get_sync) we would have
assert_called_once_with(get=dask.async.get_sync)


def test_no_user_options(self):
test_dict = {}
with self.patch(self.context, _globals=test_dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to use "with" with the IrisTest.patch() method.
The point of it is to automatically remove itself after each test, so you avoid lots of extra indentation ...

@pp-mo
Copy link
Member

pp-mo commented Apr 25, 2017

Actually I think that the tests can be much simplified.
As all the tests work the same way, instead of a combination of a setUp and boilerplate code it should be clearer to use a common utility routine which centralises the testing method.
Looking at the current functionality, I suggest these changes would be a simpler way of achieving what the current tests are doing.

@pp-mo
Copy link
Member

pp-mo commented Apr 25, 2017

Thanks @dkillick
Over to you again..

@pp-mo pp-mo merged commit accec28 into SciTools:dask Apr 25, 2017
@DPeterK
Copy link
Member Author

DPeterK commented Apr 25, 2017

Thanks for all your input @pp-mo! 🎉



# Run this at import time to set dask options for Iris.
_iris_dask_defaults()
Copy link
Member

@bjlittle bjlittle Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkillick and @pp-mo I'm assuming that the user can simply override this default behaviour by simply using dask.set_options() themselves in their own script ? ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! That's the idea. You could do something like the following...

from multiprocessing.pool import ThreadPool

import dask
import iris  # This will set default options for dask processing in Iris.

dask.set_options(pool=ThreadPool(5))  # Now the user can apply case-specific options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. As long as the user can override the default.

I love the async default. No surprises or wiggle room for odd behaviour 👍

Also, and most importantly for me, if the user does override the default, then they are taking explicit accountability for their actions ... 💥

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @dkillick and @pp-mo 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I'll make sure examples like the above are included in the user documentation when we come around to writing it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkillick If it helps, add a note in the Dask Documentation project, so that we don't forget 👍

@QuLogic QuLogic added this to the dask milestone Apr 26, 2017
bjlittle pushed a commit to bjlittle/iris that referenced this pull request May 31, 2017
Dask opts controlled by simple function
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants