-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing test_climatic_mean
#253
Comments
There are also some changes on Coiled coming up that will help us with memory management (coiled/feedback#185). We now set the memory limits both on OS level (cgroups) and on dask level (Worker.memory_limit) properly which affects how/when spill-to-disk is triggered, when workers are paused, etc. I'm not implying that this is all there is, but I strongly suspect it will help. We've seen this being a game changer for spill-to-disk specific integration tests, see #229 |
I noticed that our runtime "regressions" we've seen lately were not resolved and I took another look at this test. Thanks to @ian-r-rose I could indeed reproduce this by running the entire module (makes sense, clusters have module-wide lifetime). In fact, it is sufficient to run Specifically, work stealing is again responsible for this poor scheduling. This is an interesting example where work stealing basically drives root task overproduction by destroying co-assignment. If I disable work stealing, I get mostly as expected scheduling. I think perfect scheduling would still use much less network transfer but this a good start. Note how this task stream shows many more green tasks compared to the task stream of the OP. Green tasks are Some of the more fundamental problems that are leading up to work stealing being a problem are described in dask/distributed#6573 (comment) and dask/distributed#6573 (comment) I still need to verify but I have a very strong suspicion that this is due to |
I'm not sure if root task overproduction is related here. This example was one of the success stories of co-assignment (it was basically unrunnable before, "gave up after 30min and 1.25TiB spilled to disk"): dask/distributed#2602 (comment). I'd guess it's just that stealing un-does all the good co-assignment work, and re-assigns tasks to workers in ways that cause lots of data transfer, driving up memory and slowing down work?
This is a groupby-mean, so transfer is to be expected with the |
Well, I think there are a couple interpretations / causes for root task overproduction. Strictly speaking, I'm talking about "having way too many root tasks in memory than should be necessary". Or put differently, what I see is that the ratio between root tasks and reducers is unhealthy, i.e. we have much more root tasks in memory than we should. I suspect that the driving force behind this "root task overprod" scenario is actually not the too slow scheduler assignment. The tasks are likely scheduled on the workers sufficiently fast but since they still need to fetch data, the worker decides to run more stuff first until the dependencies have arrived. I guess this is not the "classic" overproduction we typically talk about but at the same time, I believe this is much more severe. co-assignment basically allows an incoming high-priority task to be scheduled asap without needing to fetch any data.
well, groupby-mean is basically a tree reduction, i.e. it should have little network (depending on the cardinality of the output, of course). Doesn't really matter though; I agree that what I'm seeing there is perfectly fine / pretty good. |
I did a quick re-run of some of my benchmark scripts from dask/distributed#6560 (comment). I used the latest from dask/distributed#6614 with main merged in, so it has dask/distributed#6944 included. I run all 4 cases on the same cluster, restarting in between. Separate clusters for each workload. First I run without work stealing, queued and un-queued. Then I add work stealing, and run queued and un-queued. Stealing didn't seem to have a significant effect in either of these. (Ignore the varying start times, that's my home internet connection being slow.) In the anom-mean case, it's clearly queuing that decreases memory use by 8x; turning off stealing doesn't help. In the climatic-mean case, losing co-assigment (due to queuing) makes runtime 2x longer. But without queuing, stealing on vs off didn't seem to matter. Since this included dask/distributed#6944, I'm confused why we're seeing different behavior in these tests vs in the coiled-runtime benchmarks (besides the fact that I'm using a different data and cluster size, since I'm just re-using my old script). I thought that clearing out the task groups, etc. would clear whatever timing data stealing is using to make its decisions? Are you saying there's state inside of the
fwiw, watching the dashboard for climatic-mean with co-assignment did have a little less red than your screenshot here on my cluster. |
That's a good point. That's probably part of why co-assignment made such a big difference. It's also probably why not having co-assignment in the queuing PR doesn't hurt as much as we expected: the root task overproduction can't happen while workers are waiting for dependencies. It was always interesting to me that the co-assignment PR reduced the memory usage of some of these workloads a ton. But when we removed co-assignment and added queuing, the memory usage didn't go back up by the same amount—memory stayed about the same; runtime just increased. I think that points to queuing being a more comprehensive solution to root task overproduction than co-assignment and disabling stealing. |
Don't know, yet. Need to dig a bit more, might've missed something. There are also other parts factoring in stealing decisions, e.g. total_occupancy to classify saturated (I noticed that I had negative total_occupancy after running anom_mean / before climatic_mean 🤦 ), then there is our bandwidth measurement and yes, there may also be leftover state in the extension as well. |
Copied from #243 (comment).
test_climatic_mean
is currently skipped since it fails in every CI run, but this should be investigated and the underlying cause fixed.When I run this test on its own against coiled-runtime 0.0.4, it does fine. The dashboard looks as it should, tasks go quickly. There's a little bit of spillage but not much.
However, it's always failing in the full CI job. With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it. The dashboard looked a bit worse, more data spilled to disk. Workers kept running out of memory and restarting. So progress was extremely slow, and kept rolling back every time a worker died.
Theories for why it's failing:
On distributed==2022.6.0,
MALLOC_TRIM_THRESHOLD_
hasn't been set yet by default. That might make the difference. Note though that the test passes even without it being set, if it's run on a fresh cluster. So that's clearly not the only problem. Plus, we'reclient.restart()
-ing the workers before every test, so the workers should be in the same brand-new state regardless of whether the test is run on its own, or after others. However,client.restart()
doesn't do that much to the scheduler, so maybe that's where the problem is.We've know that every subsequent time you submit a workload to the scheduler, it runs slower and slower, and scheduler memory grows and grows: Are reference cycles a performance problem? dask/distributed#4987 (comment). (There's no reason to think things have changed since that research last year.)
As the scheduler gets sluggish, it will be slower to both tell workers about data-consuimg downstream tasks to run (instead of the data-producing root tasks they've already been told to run), and it will be slower to allow them to delete keys that are completed and aren't needed anymore. Note that just because a worker runs a downstream task (like writing a chunk to zarr) doesn't mean the worker gets to immediately release the upstream data—it must be explicitly told by the scheduler to do so. If the scheduler is slow, the worker will go load even more data into memory while keeping around the chunks that have already been written to zarr and should have been released.
Thus we see the double-whammy of root task overproduction: as soon as the delicate balance of scheduler latency is thrown off, workers will simultaneously produce memory faster than they should, and release memory slower than they should:
Basically, I think this will only be fixed by Withhold root tasks [no co assignment] dask/distributed#6614, or by understanding and fixing whatever's causing the scheduler to slow down (which is further out) Are reference cycles a performance problem? dask/distributed#4987.
The text was updated successfully, but these errors were encountered: