Avoid comm/compute overlap for short-lived tasks #12964

ronawho · 2019-05-06T20:21:25Z

Avoid comm/compute overlap for short-lived (minimal comm) tasks.
Comm/compute overlap can artificially increase the lifetime of tasks
because we yield, put a task at the end of the queue, and will have to
cycle through all currently running/queued tasks before getting back to
it. This can lead to OOM situations when you have a lot of short-lived
tasks with no user-induced yielding. This can occur pretty easily today
because things like reductions will create numLocales tasks on locale 0.
These tasks are short-lived and don't have any user-induced yields, so
the OOM is pretty extreme.

Here we eliminate comm/compute overlap for unregistered puts/gets and
fast-ons in gasnet. We avoid doing comm/compute overlap for FMA (short
gets/puts and AMOs) under ugni unless a task has issues at least 100 FMA
operations. This allows us to get comm/compute overlap in cases where it
really matters for performance (many FMA request and large BTE comm)
without increasing the lifetime of short-lived tasks. We use
task-local-storage to track the number of FMA requests issued. This is
fast (~half the cost of an atomic operation) so the overhead is
negligible and still allows us to get good performance for
oversubscribed RA-atomics.

This isn't a perfect solution and primarily avoids task yields for
endcount manipulation, but I think this is better than what we had
before. A better solution is probably to decrease the size of our task
stacks (and/or avoid registering task stacks and keep them on
non-hugepages so we could limit the amount of physical memory used for
them.)

This should allow us to remove all our execenvs that reduce the task
stack size for multi-locale and scalability testing.

Closes #12874

runtime/src/comm/ugni/comm-ugni.c

gbtitus · 2019-05-07T16:31:55Z

As we've discussed previously, this still seems like an awfully indirect way to do what we want, which is basically to throttle task creation as a side effect of yields. I'd rather we do this by adding a throttle to Qthreads. But I'm resigned to my fate here.

That said, is there a strong reason not to do this directly in chpl_task_yield()? We'd need to add an argument saying that we were yielding to wait for communication to complete. But then the only difference, it seems to me, is that instead of not yielding only when waiting for small transfers after the 100th, we'd not yield when waiting for any transfer (including BTE) after the 100th. That doesn't seem a big difference, and I think it's possible we could re-code the Qthreads chpl_task_yield() so that this was effectively cost-free check. (We already do a qthread_shep() call there, but if we had to do a chpl_task_getPrvData() to support this change we could probably figure out from the result of that whether we had a shepherd or not. And I'd guess those 2 calls probably take about the same amount of time.)

ronawho · 2019-05-07T16:44:14Z

As we've discussed previously, this still seems like an awfully indirect way to do what we want, which is basically to throttle task creation as a side effect of yields. I'd rather we do this by adding a throttle to Qthreads. But I'm resigned to my fate here.

I agree this is weird, but I think it's pretty simple for now and puts us in a better state. I'll open up a new issue for a better long-term fix.

For a qthreads throttle -- I'm not quite sure what that should look like just yet. For the case of oversubscribed ra-atomics we do want to task yield so we can get overlap and it's possible not all tasks will be running so a yield_to_already_running_task() sort of thing isn't quite what we want, so we have some design and thinking to do here.

That said, is there a strong reason not to do this directly in chpl_task_yield()? We'd need to add an argument saying that we were yielding to wait for communication to complete. But then the only difference, it seems to me, is that instead of not yielding only when waiting for small transfers after the 100th, we'd not yield when waiting for any transfer (including BTE) after the 100th. That doesn't seem a big difference, and I think it's possible we could re-code the Qthreads chpl_task_yield() so that this was effectively cost-free check. (We already do a qthread_shep() call there, but if we had to do a chpl_task_getPrvData() to support this change we could probably figure out from the result of that whether we had a shepherd or not. And I'd guess those 2 calls probably take about the same amount of time.)

Hnm, yeah that's an interesting/good idea. Let me think about that.

gbtitus

👍

ronawho · 2019-05-16T23:01:29Z

Moving this logic to a yield wrapper is captured in https://github.com/Cray/chapel-private/issues/209. Offline we agreed to go with this to start and make improvements later.

Add a test that creates thousands of remote tasks (more task stacks than there is physical memory for if they actually ran concurrently.) I believe this is a test we want to work under qthreads since there is no user-induced task yields. It currently fails under both ugni and gasnet because of task yielding in the comm layers.

They should be relatively quick (and are currently used for end count manipulation, which can artificially increase the lifetime of tasks) so avoid yielding.

Avoid comm/compute overlap for FMA in ungi for tasks that haven't issued at least 100 FMA. This is a little bit odd, but it prevents us from doing comm/compute overlap for tasks that haven't done much comm. This allow us to run an arbitrary number of tasks with no user-induced yields, but still get comm/compute overlap for RA. Here we use task-local-storage to track the amount of comm. Accessing task-local storage is pretty cheap (~ half the cost of a processor atomic) so this adds virtually no overhead.

@gbtitus

Avoid more comm/compute overlap in ugni [reviewed by @gbtitus] In #12964 we eliminated comm/compute overlap for FMA operations when a task hadn't done ~100 FMA operations. Here we extend that to RDMA and chained FMA operations as well. This is motivated by aggregation work where we do on-stmts followed by large RDMA PUTs/GETs, but we really don't want too many tasks started at once since that increases memory pressure. This also adds an option to completely disable comm/compute overlap in ugni, but it's not enabled by default since it has a 2x performance hit for SSCA.

@gbtitus

Adjust remote cache to better handle task yields in comm events Before this PR, the `--cache-remote` cache has used a "lock" to ensure that only a single cache enters the cache API functions. This PR adjusts it to instead "lock" at a cache entry level. (Here "lock" means to check a variable indicating it is in use; to loop and yield until it is 0; and then set it. This is similar to how a lock works but in this case it is only managing access from other tasks as a result of `chpl_task_yield`). Details: * Added the `entryReservedByTask` to the cache entries. Adjusted `flush_entry`, `cache_get`, and `cache_put` to take this "lock". * Passed `task_local` to many functions since it is used to know if the current task has the "lock" on an entry. This allows the code to be able to know if the lock is already taken by the current task. This is primarily used in assertion checking right now. * Factored per-page operations out of put/get/invalidate * Added "lock" code around per-page operations -- put/get/invalidate as well as evictions take the lock and release it when completed * Adjusted the code waiting for nonblocking comm operations (`do_wait_for`) to better tolerate task yields * Took care with locking another entry while an entry is locked. This kind of thing can cause deadlock. The place it was present was in the `cache_get` code - it was making sure that there was a freed page while waiting for the nonblocking GET. Since this is a performance hint, I adjusted it to just give up if the page needing to be evicted is locked. * Automatic readahead when doing GETs does not lock multiple entries because it only happens on a cache hit - so it unlocks the entry with the data before starting the prefetching (which will lock each page involved, one at a time). * Added `EXTRA_YIELD` setting for debugging which does yielding around comms events. This allowed me to debug the issues with yielding using a local gasnet configuration. * Added more debug printing/tracing that I used during debugging. Additionally, I noticed that the strided get/put calls were invalidating the entire cache but now we have functions to invalidate just a region. At the same time, we have strided "common helper" functions that enumerate the contiguous regions in runtime/include/chpl-comm-strd-xfer.h. So, I updated the strided get/put support in the cache to only invalidate the portions updated by the operation, instead of the entire cache. To do so, I needed to add another variant of the code in chpl-comm-strd-xfer.h since the other functions there call `comm_get` / `comm_put` and this one needs to invalidate. We can change the setting `STRIDED_INVALIDATE_ALL` to check the performance of this change independently. Related to PRs #15266, #16020 and issues Cray/chapel-private#585 and Cray/chapel-private#879. PR #12964 talks a bit about why comm-compute overlap changes are important in comm performance. This PR improves performance for several Arkouda operations when running with `--cache-remote`. The reason it improves performance is that the previous coarse-grained locking strategy adding yields without really making progress on communication in oversubscribed settings. Tasks would be created to try to do comms but not be able to make progress. Reviewed by @gbtitus - thanks! - [x] test/runtime/configMatters/comm/cache-remote and test/optimizations/cache-remote pass with `VERIFY` and `EXTRA_YIELDS` enabled (except for `ra_prefetch.chpl` which times out due to problem size too large for this configuration but functions with a smaller problem size) - [x] full local gasnet testing - [x] full local gasnet testing with `--cache-remote`

gbtitus reviewed May 7, 2019

View reviewed changes

runtime/src/comm/ugni/comm-ugni.c Show resolved Hide resolved

chapelautomaton added the stat: cla signed label May 14, 2019

gbtitus approved these changes May 16, 2019

View reviewed changes

ronawho added 3 commits May 16, 2019 17:38

Don't yield while waiting for fast-forks under gasnet

d0c29df

They should be relatively quick (and are currently used for end count manipulation, which can artificially increase the lifetime of tasks) so avoid yielding.

ronawho force-pushed the reduce-comm-compute-overlap branch from 13ed38b to f504fc6 Compare May 17, 2019 04:31

ronawho merged commit 742cc6b into chapel-lang:master May 17, 2019

ronawho deleted the reduce-comm-compute-overlap branch May 17, 2019 06:24

ronawho mentioned this pull request Mar 13, 2020

Avoid more comm/compute overlap in ugni #15193

Merged

mppf mentioned this pull request Dec 18, 2020

Adjust remote cache to better handle task yields in comm events #16885

Merged

3 tasks

ronawho mentioned this pull request Sep 8, 2023

Support different yielding policies sandialabs/qthreads#140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid comm/compute overlap for short-lived tasks #12964

Avoid comm/compute overlap for short-lived tasks #12964

ronawho commented May 6, 2019 •

edited

Loading

gbtitus commented May 7, 2019

ronawho commented May 7, 2019

gbtitus left a comment

ronawho commented May 16, 2019

Avoid comm/compute overlap for short-lived tasks #12964

Avoid comm/compute overlap for short-lived tasks #12964

Conversation

ronawho commented May 6, 2019 • edited Loading

gbtitus commented May 7, 2019

ronawho commented May 7, 2019

gbtitus left a comment

Choose a reason for hiding this comment

ronawho commented May 16, 2019

ronawho commented May 6, 2019 •

edited

Loading