Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication costs are quadratic in the number of worker threads #394

Open
julian-seward1 opened this issue Jul 7, 2017 · 16 comments
Open
Labels

Comments

@julian-seward1
Copy link
Contributor

As a side effect of profiling Stylo for
https://bugzilla.mozilla.org/show_bug.cgi?id=1371496, I picked up some
numbers for Rayon too.

I ran the Stylo benchmark in the abovementioned bug with 1, 4, 16 and 64
worker threads, on Valgrind/Callgrind configured for fine-grain thread
interleaving -- basically all threads move forwards together.

Looking at the costs (in instructions executed, and inclusive of callees)
for rayon_core::registry::WorkerThread::wait_until, I see a cost increase
which looks extreme as the number of threads increases (MI = million insns):

p1 137.1 MI
p4 238.4 MI (1.7 x p1)
p16 3568.7 MI (15.0 x p4)
p64 177089.5 MI (49.6 x p16)

In the worst case (p64), wait_until and callees use 177 G insns, of which
about 129 G are in calls to <coco::deque::Stealer>::steal.

Looking at the code for rayon_core::registry::WorkerThread::steal, it appears that
each thread inspects the work queues of all other threads. So Rayon induces
communication costs in the underlying machine that are at least quadratic in
the number of threads.

Given that this is measured on a simulator that slows down progress hugely,
I wouldn't pay too much attention to the exact numbers. But it is clear
that Rayon might have scaling problems, especially in cases like this where
(I think) there is actually very little work to do, and so the threads spend
a lot of time searching for work. (But then why don't they end up going to
sleep? Maybe they do. I don't know.)

@nikomatsakis
Copy link
Member

But then why don't they end up going to sleep? Maybe they do. I don't know

They should eventually... but it depends. Right now, our sleep mechanism is not too smart, in that it wakes up all threads whenever work comes in. It's basically still oriented around a binary "on/off" setup, for the most part, although we have more capability to have only "some" threads active than before. The original intention was that they would all wake up whenever work arrives, but some would go to sleep -- the thresholds for that latter step may however be too high. Moreover, they currently take turns going to sleep, mostly because it helped to simplify the sleeping logic (originally I was trying to have multiple threads go to sleep independently), so this may exacerbate the problem in simulation particularly.

@ghost
Copy link

ghost commented Jul 7, 2017

Perhaps we also ought to try stealing more than one job from a deque - a common strategy is to steal half of the jobs from another deque. That could reduce the cost of distributing jobs among worker threads quite a bit.

I'll try extending deques to support stealing multiple jobs at once and see how it works out in Rayon. Any thoughts about that?

@julian-seward1
Copy link
Contributor Author

@stjepang That sounds like it would help in at least some cases. For the cases where
there are a lot of threads and not much work to do, it doesn't sound useful, for example
if work queues mostly only have one item in anyway.

But worth a try at least from my naive point of view. @nikomatsakis, what do you think?

Even better would be to have an analytical cost model of the Rayon core scheduler so
we could more easily reason about these corner cases without having to hack around the
code to understand its behaviour.

@nikomatsakis
Copy link
Member

nikomatsakis commented Jul 7, 2017

I'm certainly not opposed to giving it a try. As I mentioned to @julian-seward1 on IRC, though, it doesn't strike me as always a win. Parallel iterators, for example, purposefully lay out their tasks so that the first one to be stolen is roughly 50% of the work (and the next one represents 50% of the remaining work, and so forth). This was part of the overall Cilk design -- when possible, it's generally better I think than creating a ton of fine-grained tasks up-front, since it allows you to parallelize the effort of doing that parallelization. For example, with parallel iterators, as we pop tasks off the deque, if we find they are not being stolen, we switch to a purely sequential execution for bigger and bigger groups of items; since it is often more efficient to process a batch of items in a seq loop, that helps to amortize the total overhead (I guess you could imagine trying to pop off more than one task at a time too, to achieve the same effect?)

In any case, having the ability to steal (or pop) more than one item at a time would be a great tool in the toolbox, no question.

The other obvious thing to do is to batch up the the threads into groups, or a hierarchy, and have them steal from within their group first, before attempting to steal globally. This might eliminate some of the O(n^2) execution.

Something that's unclear to me is how much cost is incurred during the "going to sleep" phase. Ideally, if all tasks are generating parallel opportunities, stealing should be relatively rare. But when there isn't enough work to go around, you will certainly see a lot of cycles spent trying to steal (which doesn't necessarily hurt performance, though it doesn't help; and it certainly doesn't help with power). It seems worth attempting to tune those parameters and see what happens.

Naturally I'm all for analytical modeling, though I'm a bit dubious on its true predictive power. There are so many factors at play.

@cuviper
Copy link
Member

cuviper commented Jul 7, 2017

It would be nice if we could move you away from the model of spawning lots of discrete jobs. It seems like this is always going to be a worst case of requiring lots of "stealing" from RegistryState::job_injector to get anything done.

Maybe we could apply the idea of stealing a batch just for that injector queue?

I haven't looked at coco's memory allocation, but I'm imagining if it could just hand us a whole Box<[JobRef]>, then we can split that up further the same way parallel iterators do. (not literally with par_iter since this is in core)

@bholley
Copy link
Contributor

bholley commented Jul 10, 2017

@cuviper I don't really follow your proposal, can you elaborate on it? In general it seems pretty hard to move away from spawn. We don't know all the work up front, and instead discover it as we explore the DOM top-down. Moreover, the processing order matters for the style sharing cache, so we want to be careful in that regard to maintain breadth-first ordering.

The more fundamental issue here is that we seem to have a Thundering Herd problem. The behavior in the Speedometer testcase is not (IIUC) that we saturate 64 threads and then run out of work later. On the contrary, most of the threads presumably spend their entire lifetime trying to steal and never find any work, because we wake up all the threads at once and then burn cycles while they inspect each others' empty queues. It seems like we should wake threads up one at a time, and only wake up thread N when thread N-1 has successfully found a work item to steal.

@cuviper
Copy link
Member

cuviper commented Jul 10, 2017

OK, the way I was thinking was how to distribute lots of individual work more efficiently. If there's just not enough work to go around, then yes I agree it's a problem of managing the idle "herd".

Right now we use Condvar::notify_all, and we could try switching to Condvar::notify_one, with the potential cost that we don't spin up quite as quickly when there is lots of work to be done. That needs to be measured.

@bholley
Copy link
Contributor

bholley commented Jul 10, 2017

Could we perhaps measure the size of the queue we just stole from? If it's less than the total number of threads we'd wake up with notify_all, we do notify_one.

@nikomatsakis
Copy link
Member

I think waking up fewer threads is good, but I'm not sure that notify_one is the answer -- I have to review the logic, I guess. I think that in some cases (e.g., after a latch is signaled), we rely on the fact that we wake up everyone -- if we just woke up one thread, it might not be the one that was waiting on the latch. We could treat "new jobs" available (which can be handled by anyone) differently from "latch signaled", for sure, but it seems like if we are waking up all threads every time a latch is signaled, we're still going to have a hungry herd. We'd need to do better than that I think.

I suspect that what we really want in any case is a kind of "quick ramp-up" -- i.e., start a few threads at first, but escalate to the full herd once we've seen some evidence that there will be a lot of items to come.

It'd be worth studying more closely what other threadpools are doing in this respect. From discussions I've had with other implementors, there are a lot of trade-offs involved, and everybody has to ultimately make some choice somewhere.

@julian-seward1
Copy link
Contributor Author

I think waking up fewer threads is good [..] we rely on the fact that we wake up everyone [..]

This seems to me like we're discussing policy and implementation together.
Can we first can discuss policy (what dynamic behaviour we want) without
discussing implementation? Presumably we can implement pretty much any
policy we want once we've decided on it.

It seems to me that there are four things we need to characterise, in terms of
"active" (non-sleeping) threads:

  • the ramp-up profile when there is "enough work". That is, should we increase the
    number of active threads exponentially? Linearly? Some other way?

  • the ramp-down profile when there is "too little work". Same question (linear decline? etc)

  • definition of "enough work". Should that be based on queue sizes, or recent utilisation
    levels of active threads, or something else?

  • definition of "too little work". Ditto. Presumably we'd want to keep the thresholds
    reasonably far apart in order to avoid flipping between states.

In the case where there is little work to do, it is also important to avoid scattering small
bits of work between a large number of mostly idle threads. In unrelated work on Valgrind,
we have seen large (2 x) slowdowns caused by poor interactions with cpu frequency
scaling in that scenario. This means we'd need to have a priority scheme for thread
activation.

From discussions I've had with other implementors, there are a lot of trade-offs involved

Can you summarise them?

@nikomatsakis
Copy link
Member

@julian-seward1

This seems to me like we're discussing policy and implementation together.

Indeed. It's a good idea to separate them.

It seems to me that there are four things we need to characterise, in terms of
"active" (non-sleeping) threads:

Those are good questions, and I don't now the answers. It seems to me that the best thing would be to try and read into other code-bases and see if we can get a kind of survey of the "menu" that others have attempted.

Some code-bases that seem worth investigating (all of which I believe are open source, for some definition thereof):

We can also try getting in touch with the authors and see if they have any tips to share with us. =)

Can you summarise them?

Not in much depth -- these were relatively casual conversations. The basic tradeoff is pretty clear: the slower you ramp up, the more you increase latency, and you have very few signals as to what the program is ultimately going to do.

@nikomatsakis
Copy link
Member

cc @Amanieu -- btw, based on our conversation at ECOOP, I thought you might be interested in tracking this issue and/or have useful thoughts. =)

@Amanieu
Copy link
Contributor

Amanieu commented Jul 16, 2017

In Async++ I keep a list of threads that are sleeping.

Whenever a new task is added to the pool, it will check if there is a sleeping thread and wake one up. This ensures that there is always a thread available to perform work.

There are two places where a thread must do work while waiting: at the root of a worker task, and while joining. Both of these are handled by the same function. It basically performs the following in a loop:

  1. If the join task is done (represented by some sort of atomic flag), then return.
  2. Try to get a task from the queue of the current thread. If we got one, execute it then loop back to 1.
  3. Go through the queues of every other thread, trying to steal a task from them. If we got one, execute it then loop back to 1.
  4. Try to get a task from the public queue (for tasks that are sent from outside the thread pool). If we got one, execute it then loop back to 1.
  5. At this point, no tasks are available to run. Each thread has an event object which it can sleep on and which other threads can signal. [For non-root only] We atomically register the event object with the join task, so that either the task has already been completed (in which case we can just return), or the event object will be signaled by the thread that completes the task.
  6. We add our event object to the list of sleeping threads, and wait for it to be signaled.
  7. After we have been signaled, remove the current thread from the list of waiters, and loop back to 1.

This model is pretty simplistic, but it allows the number of threads to quickly ramp up when new tasks are added.

@coder543
Copy link

In one workload I'm benchmarking, there's only enough work to be done to keep several threads occupied, but Rayon will completely consume all 16 threads on my processor if given the chance. To be specific, I'm looking at the "zapper_par" benchmark from my upcoming (soon to be announced) library. Setting RAYON_NUM_THREADS=4 only reduces the total throughput by 10% versus using all 16 threads. Interestingly, the performance is 17.2% better at 8 threads than it is at 16... which makes me think it's related to the overhead of communication costs becoming significant when there's no work available for those other threads to do.

Is there a plan to work on this issue? I think Rayon is a great library, I just wanted to report an experience I was having!

@mbyio
Copy link

mbyio commented Feb 16, 2019

Recently, I was trying to unstable_sort a 50 GiB dataset in memory on a 48 core machine, and I might have run into this issue. Instead of remaining at 70-100% CPU usage, the machine seemed to bounce between 20-80%. The next time it comes up I will try to record more precise data.

@joshtriplett
Copy link

I'm wondering if this might relate to the performance limitations of parallel rustc compiles on large (e.g. 72-way) systems. Profiles suggest that on such systems rustc spends a huge amount of time doing synchronization/stealing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants