-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask scheduler num_workers #2403
Comments
Fwiw if a Distributed scheduler is defined then that will be used by default, so the current implementation will work with either the threaded / distributed scheduler depending on what the user has done. Defining a way to pass arguments into the scheduler is definitely useful. I feel like I would want the default settings to use 100% of available resources and let dask / my cpu worry about the consequences, but maybe that's naive. |
I briefly discussed this with @marqh, and he mentioned some of the multi-user systems where Iris is deployed and used. I agree that we should test how dask behaves on these systems before going 'live', but I still believe that if you're running a multi-user system it's your responsibility to ensure that your users cpu access is managed. Making our code slower by default seems wrong to me, but I appreciate that we may need to be pragmatic... |
My feeling about this is that Iris should run single thread/process by default, with users needing to opt in if they wish to use the multiprocess goodness that dask offers. This needn't be the default long-term, but I think it is the most appropriate introductory approach. I have a couple of reasons to back up this assertion:
Of course once the two concerns above are resolved then we can reconsider the default multiprocess behaviour in Iris. |
In #2457 @bjlittle suggested the pattern of having with iris.options.parallel(num_workers=6, scheduler='multiprocessing'):
iris.load('my_dataset.nc') Or iris.options.parallel(num_workers=6, scheduler='multiprocessing')
iris.load('my_dataset.nc') We can get allowed values for iris.options.parallel(scheduler='192.168.0.219:8786') In def as_concrete_data(array):
if is_lazy_data(array):
num_workers = iris.options.parallel.get('num_workers')
scheduler = iris.options.parallel.get('scheduler')
result = array.compute(num_workers=num_workers, get=scheduler.get)
... Alternatively, we could use Thoughts please people! |
I think this represents a reasonable approach I would advocate a '1' being explicitly set somewhere and I would like to make time to investigate the potential implications of selecting a different, suitably small number, such as 'Three' (it's a magic number ;) as this might provide some neat benefit with limited risk
I don't know how clear this would be As part of the documentation for the release, I think we must provide a page on dask. As part of this, we should explore scenarios where I want to reconfigure dask in a certain way. A couple come to mind:
So, all in favour, keen to contribute to the thought process and implementation |
Unbelievable 🎵
The methodology preferred by dask to set runtime options is |
See here for context.
We need to consider how we easily control the number of threads, processes or workers that the dask scheduler uses - particularly for users targeting a shared resource such as a server or cluster.
Ping @marqh
The text was updated successfully, but these errors were encountered: