-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust worker batched send interval to cluster size #5036
Conversation
- Both size and CPU based metrics in here. CPU kinda makes more sense to me, but should be weighted avg'd or something. - CPU test is way too unstable - Test is a little slow because of scale up
Update: these results are completely invalid and should be ignored; see my next comment. I'm sad to say that initial tests show this doesn't seem to affect performance. But I'll say that's a maybe, because I'm not yet sure my testing methodology is sound. To try to pick good values for the new interval, I ran a shuffle workload (imperfectly scaled by cluster size) on Coiled clusters, doing a parameter sweep over cluster size and batched-send interval. The values I tried are here. Because of the increasing-duration-with-reruns problem noted in #4987 (comment), I'm only running each computation once on a cluster, then throwing it away and starting a new one. I have a feeling that's contributing to the variation here. In fact, instance types, etc. could even be changing between runs (since Coiled will over-provision if instances matching your request aren't available). Also, I'm not sure that linearly rescaling the number of tasks in the workload to cluster size makes sense, since the communication costs of the shuffle are O(n^2) on the size of the cluster, right? Nonetheless, grouping by cluster size, the batched send interval seems to have very little effect on performance? Or maybe there's too much noise. Full data is here. |
Well, the number of edges in the graph should scale with I would expect a performance improvement to appear on the limit of scheduler CPU saturation. Looking at the numbers in your table, I wouldn't expect the scheduler to be actually saturated for a long enough time for this to make a big difference. The largest graph you're working with is "only" 120k tasks and runtimes are in the minute range. Maybe I'm wrong here but I would try to push this harder, i.e. one big graph, medium to large cluster as a setup. On this setup I would do a parameter scan over the send interval I'd also be interested in two more things in this area but I don't want to blow up the scope of your investigation so feel free to ignore this:
|
Update: as expected, there was a bug in my benchmarking code ( Here are some actual results. I would have much more, but accidentally lost the results from a bigger run. I can rerun this and get more data to smooth out the noise if we'd find that helpful. Lengthening the batched send interval decreases overall runtime, significantly so for larger clusters. Unsurprisingly, it slightly increases runtime for small clusters, which are likely not scheduler-bound, so we should be cautious of this. I don't have enough data (it's too noisy) to confidently say:
I also think this is interesting and haven't decided what to make of it yet: Looking at tasks per second per worker, larger clusters are just... bad. A 10-worker cluster can have 8x higher throughput per worker than a 100-worker cluster. I'm not sure if this measurement is fair though, because I'm still scaling the workload by the number of workers. Maybe it makes sense that the difficulty of a shuffle will scale non-linearly?
Here are histograms of the buffer lengths (number of messages) at every comm send: And
I've increased the graph sizes across the board; the largest is now 762k tasks. Full results:
|
These results look great. Even though you cannot tell for sure what the exact results will look like, taking the 100 cluster size as an example it looks like one can conservatively state that you manage to reduce runtime by at least 10-20%. I guess this is only as dramatic because shuffle puts a lot of stress on the network. I doubt this will be as dramatic for less network intensive, but high task throughput graphs. You could test this on some non-fused embarrassingly parallel workload. High throughput, (every task transition is still orchestrated by the scheduler) but low network load. I love the histograms, by the way. Is this some fragile code or something we could contribute? I have to admit these distributions are much wider than I anticipated. What's nice is that even for larger intervals, we're still in the ~10-100KB range for one batch which I would assume is a nice package size. We can also see how the default configuration causes us to have barely any batching since in about 30% of all cases we have a single message. Do you have any particular tests in mind you'd like to run before we commit to a scaling function or what would you suggest as next steps? I don't think we'll ever find the perfect setting, just a slightly less bad one :)
I'm not too surprised, see https://coiled.io/computers-are-not-as-parallel-as-we-think/ ;) |
I've collected some more data and am feeling more unsure about this. The data is very noisy. There seems to be more variation within a particular batched send interval than between them. I've gone all the way to a 1,000ms batched send interval hoping to find a clear point where performance drops off. Full results (note y-axis is different for each subplot): Maybe in fact the performance drop-off happens quickly, and beyond 100ms or so the difference is irrelevant, so we want to just look at the shorter intervals? Intuitively, I still feel like the batched send interval is too short, and that it makes sense to scale it dynamically. However, I'm not seeing strong evidence for this in the data. If we're looking for the inflection point, maybe we should just look at the buffer size histograms. The pattern changes most significantly from 2ms to 25ms, from "very little batching" to "normal-ish distribution of batch sizes centered around 10". Beyond that, it's the same normal distribution, just marched further to the right. I'm still not sure what number to put here in this PR (target number of messages received by the scheduler per second). Anyone have any ideas, or ideas for how to decide? Should we set this PR aside and maybe just bump the interval to 5ms or 10ms? A last thing: here are plots converted into that metric ( Full results (note x is log-scale): I do realize I my data doesn't cover this metric's range very well. Here's the distribution I've recorded: Perhaps it would be worth running benchmarks under this PR, and try a few |
I'm not sure if I have a good enough intuition about It might just be that our premise is wrong. While I second the intuition that for a large number of workers the interval should be larger I'm not 100% convinced anymore. The scheduler is still able to accept incoming network traffic to some extend even if it is busy, that's the entire point of using Another error source is the network itself. I would assume that in a busy datacenter latencies and bandwidth vary over time. We do measure both but factoring these into this analysis is not trivial. However, it may be necessary to actually scale this interval based on network health instead of cluster size, idk. Ultimately, do your examples provoke any work starvation? As long as the worker ready queue is full, I assume the communication between scheduler and worker is not a hard bottleneck. After all, the workers will do some work all the time even thought it might not be the optimal order of work. We might observe different memory usage if that's the case but probably only very minor variations. I'm not sure how to proceed here. There is likely something underneath this data but I'm starting to think that it will require an insane amount of data collection and non trivial error estimations in this huge parameter space to get to the bottom of it. Factoring all of this in is worthy a PhD thesis and I'm inclined to put this experiment to rest until we have a more solid foundation for the experiment itself, e.g. a reproducer we are very confident to be throttled by scheduler<->worker latency Thoughts? |
Dynamically adjust how often workers send updates to the scheduler based on the size of the cluster.
See #4881 (comment) for context.
Initially, I thought about using the scheduler's current CPU utilization as the scaling metric (rather than purely cluster size), since that's the resource this is actually trying to conserve. However, CPU is quite spiky, making it both hard to test and possibly a poor metric without some more complex code to dampen that spikiness.
I currently have the values of 5ms-100ms range, targeting 5k scheduler messages per second. This is roughly based on the fact that a 20ms interval seemed decent for a 100 worker cluster in #4881 and on @mrocklin's comment here #4881 (comment). These numbers are provisional, and I want to do more testing before merging this.
black distributed
/flake8 distributed
/isort distributed