You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There currently a WIP PR open that suggests to modify dask.order, see dask/dask#10535 In dask/dask#9995 and linked issues it has been suggested that wrong ordering is/was a major component to the success of root task queuing. With this in mind, I set out to test this hypothesis and looked at specifically troubled workloads that were "fixed" by root task queuing.
So far, I looked into one specific test case that we're listing as test_vorticity in our benchmarks which was virtually impossible to run efficiently before task queuing. What is the state of this after dask/dask#10535 ?
With queuing enabled, actually nothing changes. Both memory usage and runtime stays constant. Good news.
With queuing disabled, this is actually very different.
This is walltime of the test case and avg/peak memory usage looks the same. So, we still need queuing enabled. However, this massive difference is not in alignment with the theory that ordering and queuing is so strongly related.
I ran a slightly different test and got very surprising and interesting results.
Effectively, I disabled the entire root-ish classification logic and performed a couple of measurements.
Note disabling root-ish logic is quite simple even without modifying code, e.g. with dask.annotate(restrictions=list(client.scheduler_info()["workers"])): since rootish is disabled for tasks with resources or restrictions.
So, the case without queuing is taking roughly twice the memory. However, this added memory usage is not because later reducer tasks are not run in time but rather because we have a very, very mild root task overproduction. This amount of overproduction is exactly what one would expect following due to the scheduler->worker latency. Effectively, the worker keeps loading data until the scheduler allows it to run a reducer. To scheduler a reducer requires a network roundtrip which is slower than scheduling a new task on the threadpool, i.e. we load about twice as much data.
So, what is happening here? I haven't only disabled task queuing but I disabled root-ish task classification, i.e. I explicitly bypassed also the "is root-ish task but not queued" logic which tries to be smart about placing data (i.e. "co-assignment").
I also subtly changed the way how decide_worker functions by setting restrictions.
It may be worth to further investigate this diff and whether root-ish classification could be removed again.
Why would we remove root-ish classification again? It is working well, isn't it?
Well, it is only working for some cases (#8005) and it is known to slow down certain workloads.
For instance, looking at benchmarking results of dask/dask#10535 shows that some workloads could be up to 50% faster when ran without queuing (at the cost of more memory usage)
Task queuing also doesn't work for tasks with resource, worker or host restrictions which can be surprising to users relying on this.
Last but not least, it is driving internal complexity quite a bit.
The text was updated successfully, but these errors were encountered:
There currently a WIP PR open that suggests to modify dask.order, see dask/dask#10535 In dask/dask#9995 and linked issues it has been suggested that wrong ordering is/was a major component to the success of root task queuing. With this in mind, I set out to test this hypothesis and looked at specifically troubled workloads that were "fixed" by root task queuing.
So far, I looked into one specific test case that we're listing as
test_vorticity
in our benchmarks which was virtually impossible to run efficiently before task queuing. What is the state of this after dask/dask#10535 ?With queuing enabled, actually nothing changes. Both memory usage and runtime stays constant. Good news.
With queuing disabled, this is actually very different.
This is walltime of the test case and avg/peak memory usage looks the same. So, we still need queuing enabled. However, this massive difference is not in alignment with the theory that ordering and queuing is so strongly related.
I ran a slightly different test and got very surprising and interesting results.
Effectively, I disabled the entire root-ish classification logic and performed a couple of measurements.
Note disabling root-ish logic is quite simple even without modifying code, e.g.
with dask.annotate(restrictions=list(client.scheduler_info()["workers"])):
since rootish is disabled for tasks with resources or restrictions.So, the case without queuing is taking roughly twice the memory. However, this added memory usage is not because later reducer tasks are not run in time but rather because we have a very, very mild root task overproduction. This amount of overproduction is exactly what one would expect following due to the scheduler->worker latency. Effectively, the worker keeps loading data until the scheduler allows it to run a reducer. To scheduler a reducer requires a network roundtrip which is slower than scheduling a new task on the threadpool, i.e. we load about twice as much data.
So, what is happening here? I haven't only disabled task queuing but I disabled root-ish task classification, i.e. I explicitly bypassed also the "is root-ish task but not queued" logic which tries to be smart about placing data (i.e. "co-assignment").
I also subtly changed the way how
decide_worker
functions by setting restrictions.It may be worth to further investigate this diff and whether root-ish classification could be removed again.
Why would we remove root-ish classification again? It is working well, isn't it?
Well, it is only working for some cases (#8005) and it is known to slow down certain workloads.
For instance, looking at benchmarking results of dask/dask#10535 shows that some workloads could be up to 50% faster when ran without queuing (at the cost of more memory usage)
Task queuing also doesn't work for tasks with resource, worker or host restrictions which can be surprising to users relying on this.
Last but not least, it is driving internal complexity quite a bit.
The text was updated successfully, but these errors were encountered: